Назад към всички

zhipu-asr

// Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namezhipu-asr
descriptionAutomatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio transcription with context prompts, custom hotwords, and multiple audio formats.
metadata[object Object]

Zhipu AI Automatic Speech Recognition (ASR)

Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.

Setup

1. Get your API Key: Get a key from Zhipu AI Console

2. Set it in your environment:

export ZHIPU_API_KEY="your-key-here"

Supported Audio Formats

  • WAV - Recommended, best quality
  • MP3 - Widely supported
  • OGG - Auto-converted to MP3
  • M4A - Auto-converted to MP3
  • AAC - Auto-converted to MP3
  • FLAC - Auto-converted to MP3
  • WMA - Auto-converted to MP3

Note: The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.

File Constraints

  • Maximum file size: 25 MB
  • Maximum duration: 30 seconds
  • Recommended sample rate: 16000 Hz or higher
  • Audio channels: Mono or stereo

Usage

Basic Transcription

Transcribe an audio file with default settings:

bash scripts/speech_to_text.sh recording.wav

Transcription with Context

Provide previous transcription or context for better accuracy:

bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"

Transcription with Hotwords

Use custom vocabulary to improve recognition of specific terms:

bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"

Full Options

Combine context and hotwords:

bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"

Parameters:

  • audio_file (required): Path to audio file (.wav or .mp3)
  • prompt (optional): Previous transcription or context text (max 8000 chars)
  • hotwords (optional): Comma-separated list of specific terms (max 100 words)

Features

Context Prompts

Why use context prompts:

  • Improves accuracy in long conversations
  • Helps with domain-specific terminology
  • Maintains consistency across multiple segments

When to use:

  • Multi-part conversations or meetings
  • Technical or specialized content
  • Continuing from previous transcriptions

Example:

bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"

Hotwords

What are hotwords: Custom vocabulary list that boosts recognition accuracy for specific terms.

Best use cases:

  • Proper names (people, places)
  • Domain-specific terminology
  • Company names and products
  • Technical jargon
  • Industry-specific terms

Examples:

# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"

# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"

# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"

Workflow Examples

Transcribe a Meeting

# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav

# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"

# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"

Transcribe a Lecture

bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"

Process Multiple Files

for file in recording_*.wav; do
    bash scripts/speech_to_text.sh "$file"
done

Audio Quality Tips

Best practices for accurate transcription:

  1. Clear audio source

    • Minimize background noise
    • Use good quality microphone
    • Speak clearly and at moderate pace
  2. Optimal audio settings

    • Sample rate: 16000 Hz or higher
    • Bit depth: 16-bit or higher
    • Single channel (mono) is sufficient
  3. File preparation

    • Remove silence from beginning/end
    • Normalize audio levels
    • Ensure consistent volume

Output Format

The script outputs JSON with:

  • id: Task ID
  • created: Request timestamp (Unix timestamp)
  • request_id: Unique request identifier
  • model: Model name used
  • text: Transcribed text

Example output:

{
  "id": "task-12345",
  "created": 1234567890,
  "request_id": "req-abc123",
  "model": "glm-asr-2512",
  "text": "你好,这是转录的文本内容"
}

Troubleshooting

File Size Issues:

  • Split audio files larger than 25 MB
  • Reduce sample rate or bit depth
  • Use compression (MP3) for smaller files

Duration Issues:

  • Split recordings longer than 30 seconds
  • Process segments separately
  • Use context prompts to maintain continuity

Poor Accuracy:

  • Improve audio quality
  • Use hotwords for specific terms
  • Provide context prompts
  • Ensure clear speech and minimal noise

Format Issues:

  • Ensure file is .wav or .mp3
  • Check file is not corrupted
  • Verify audio can be played by standard players

Limitations

  • Maximum audio duration: 30 seconds per request
  • File size limit: 25 MB
  • Maximum hotwords: 100 terms
  • Context prompt limit: 8000 characters
  • Best performance with Chinese language audio

Performance Notes

  • Typical transcription time: 1-3 seconds
  • Real-time or faster for most audio
  • Processing time scales with audio quality and length