Назад към всички

qwen3-tts-local-inference

// Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice des

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
nameqwen3-tts-local-inference
descriptionGenerate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice descriptions. Supports custom-voice (9 speakers), voice-design (natural language), and voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default) and 1.7B (large) models available. Runs entirely offline after model download.

Qwen3-TTS — Local Inference (No Server)

Run Qwen3-TTS directly in Python — no HTTP server, no REST API. Call a script or import the engine in your own code.

Quick reference

ModeWhat it doesKey args
custom-voice9 built-in speakers, optional emotion/style--speaker, --instruct
voice-designDescribe the voice in natural language--instruct (required)
voice-cloneClone from ~3 s reference audio--ref-audio, --ref-text

Available Speakers

The CustomVoice model includes 9 premium voices:

SpeakerLanguageDescription
VivianChineseBright, slightly edgy young female
SerenaChineseWarm, gentle young female
Uncle_FuChineseSeasoned male, low mellow timbre
DylanChinese (Beijing)Youthful Beijing male, clear
EricChinese (Sichuan)Lively Chengdu male, husky
RyanEnglishDynamic male, rhythmic
AidenEnglishSunny American male
Ono_AnnaJapanesePlayful female, light nimble
SoheeKoreanWarm female, rich emotion

Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, Auto


1 — Setup

Install dependencies once (from the skill directory):

First-time setup (one-time):

bash scripts/setup.sh

Custom download location:

python scripts/download_models.py --model-dir /path/to/models

Models are stored under {baseDir}/models/ by default. Override with QWEN_TTS_MODEL_DIR env var or --model-dir flag.


2 — Generate speech (CLI)

Custom Voice (default)

cd {baseDir}
python scripts/tts.py "Hello, how are you today?" --speaker Ryan --language English

With emotion/style instruction:

python scripts/tts.py "Great news everyone!" --speaker Aiden --instruct "cheerful and energetic"

Voice Design

Describe the voice in natural language:

python scripts/tts.py "Welcome to our show!" \
  --mode voice-design \
  --language English \
  --instruct "Warm, confident female voice in her 30s with a slight British accent"

Voice Clone

Clone a voice from a short (~3 s) reference audio clip:

python scripts/tts.py "This is spoken in the cloned voice." \
  --mode voice-clone \
  --language English \
  --ref-audio path/to/reference.wav \
  --ref-text "Transcript of the reference audio."

Common options

FlagPurpose
-o output.wavSave to exact file path instead of auto-named file
--output-dir DIROverride output directory (default: tts_output/)
--model-dir DIROverride model directory
--jsonPrint result as JSON
-vVerbose logging

3 — Python API

Use the engine directly in code:

import sys
sys.path.insert(0, "{baseDir}/scripts")

from inference import TTSInferenceEngine

engine = TTSInferenceEngine(
    model_dir="{baseDir}/models",   # optional, uses default if omitted
    output_dir="./tts_output",       # optional
)

result = engine.generate_custom_voice(
    text="Hello world!",
    language="English",
    speaker="Ryan",
    instruct="calm and professional",
)
print(result)
# {"file": "tts_output/custom_voice_20260218_...wav", "duration_s": 1.23, "inference_s": 4.56}

Available methods:

  • engine.generate_custom_voice(text, language, speaker, instruct)
  • engine.generate_voice_design(text, language, instruct)
  • engine.generate_voice_clone(text, language, ref_audio, ref_text)
  • engine.status() — returns loaded variant, device, paths

4 — Configuration

All settings are controlled via environment variables. Set them before running.

VariableDefaultDescription
QWEN_TTS_MODEL_SIZEsmallsmall (0.6B) or large (1.7B)
QWEN_TTS_MODEL_DIR{baseDir}/modelsWhere model weights are stored
QWEN_TTS_DEVICEauto (cuda:0 or cpu)Inference device
QWEN_TTS_DTYPEauto (bfloat16 / float32)Model precision
QWEN_TTS_OUTPUT_DIR./tts_outputWhere generated .wav files are saved

Switch to the 1.7B model:

set QWEN_TTS_MODEL_SIZE=large
python scripts/tts.py "Hello world"

Use a custom model directory:

set QWEN_TTS_MODEL_DIR=D:\my-models\qwen-tts
python scripts/tts.py "Hello world"

Important notes

  • Small model (0.6B) is the default. It uses less RAM and is faster. Switch to large (1.7B) for higher quality.
  • CPU inference is slow. Expect 30-120 s per sentence for the 1.7B model. The 0.6B model is roughly 2x faster.
  • Only one model variant is loaded at a time. Switching modes (e.g. custom-voice to voice-clone) triggers a model swap.
  • Output .wav files land in tts_output/ by default.
  • Models are downloaded to {baseDir}/models/ by default. Run download_models.py --size all to pre-download both sizes for offline use.
  • Voice Design mode has no 0.6B variant — it always uses the 1.7B model regardless of QWEN_TTS_MODEL_SIZE.