Назад към всички

salute-speech

// Transcribe audio files using Sber Salute Speech async API. Russian-first STT with support for ru-RU, en-US, kk-KZ, ky-KG, uz-UZ.

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namesalute-speech
descriptionTranscribe audio files using Sber Salute Speech async API. Russian-first STT with support for ru-RU, en-US, kk-KZ, ky-KG, uz-UZ.
metadata[object Object]

Audio Transcription with Sber Salute Speech

Transcribe audio/video files to text with timestamps via Salute Speech async REST API.

Requirements

  • API Key: Environment variable SALUTE_AUTH_DATA must be set (Base64-encoded client_id:client_secret or raw authorization key from https://developers.sber.ru/studio/).
  • SSL note: The script disables SSL verification by default (verify_ssl=False) because Sber's certificate chain is non-standard. This is expected.

Supported formats & encodings

Audio encodingContent-TypeTypical extensions
MP3audio/mpeg.mp3
PCM_S16LEaudio/wav.wav
OPUSaudio/ogg.ogg, .opus
FLACaudio/flac.flac
ALAWaudio/alaw.alaw
MULAWaudio/mulaw.mulaw

Supported languages

ru-RU, en-US, kk-KZ (Kazakh), ky-KG (Kyrgyz), uz-UZ (Uzbek).

Workflow

  1. Identify input files — from user request.
  2. Read API key from host environment.
  3. Run transcription — execute salute_transcribe.py with uv and appropriate arguments.
  4. Deliver results — present to user human-readable transcript with timestamps to the user and give a direct link to files.

Usage

uv run --with requests {baseDir}/salute_transcribe.py \
  --file /path/to/audio.mp3 \
  --output_dir ~/.openclaw/workspace/transcriptions \
  --lang ru-RU

Arguments

ArgumentRequiredDefaultDescription
--fileYesPath to audio/video file
--output_dirNo~/.openclaw/workspace/transcribationsOutput directory for results
--langNoru-RULanguage code: ru-RU, en-US, kk-KZ, ky-KG, uz-UZ
--audio-encodingNoMP3Codec: MP3, PCM_S16LE, OPUS, FLAC, ALAW, MULAW
--modelNogeneralRecognition model: general or callcenter
--hyp-countNo1Number of alternative hypotheses: 1 or 2
--max-wait-timeNo300Max seconds to wait for async result
--printNooffAlso print transcription to stdout

Content-Type mapping

When the file extension doesn't match audio/mpeg, adjust content_type in the script or add logic. Current default is audio/mpeg (MP3). For .wav files use audio/wav, etc.

Output files

For input file meetingABC.mp3 the script produces:

FileDescription
meetingABC_recognition_orig.jsonRaw API response (full JSON with all hypotheses, timing, confidence)
meetingABC_pretty.txtFormatted human-readable transcript with timestamps

Output text format

[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.

[00:20 - 00:45]:
Следующий фрагмент текста здесь.

Notes

  • Token is valid for ~30 minutes; the script fetches a new one each run.
  • Large files (>1 hour) may need --max-wait-time increased beyond 300s.
  • The callcenter model is optimized for telephony audio (8kHz, mono).
  • Profanity filter is disabled by default (enable_profanity_filter=False).
  • The script uses normalized text by default (numbers as digits, abbreviations expanded). Raw text is also available in the JSON output.