qwen3-tts-mlx

// Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

$ git log --oneline --stat

stars:1,933

forks:367

updated:March 4, 2026

SKILL.mdreadonly

SKILL.md Frontmatter

nameqwen3-tts-mlx

descriptionLocal Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

metadata[object Object]

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

Generate speech fully offline on a Mac
Produce narration, audiobooks, podcasts, or video voiceovers
Create multilingual TTS with controllable style and emotion
Clone any voice from a short audio sample
Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

Variant	Model	Size	Memory	Use Case
CustomVoice	`mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit`	~1GB	~4GB	Built-in voices + style control (recommended)
VoiceDesign	`mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit`	~2GB	~5GB	Create voices from text descriptions
Base	`mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit`	~1GB	~4GB	Voice cloning from reference audio

Supported Languages

Language	Code	Notes
Auto-detect	`auto`	Default, detects from text
Chinese	`Chinese`	Mandarin
English	`English`
Japanese	`Japanese`
Korean	`Korean`
French	`French`
German	`German`
Spanish	`Spanish`
Portuguese	`Portuguese`
Italian	`Italian`
Russian	`Russian`

Built-in Voices

Voice	Language	Character
Vivian	Chinese	Female, bright, young
Serena	Chinese	Female, gentle, soft
Uncle_Fu	Chinese	Male, authoritative, news anchor
Dylan	Chinese	Male, Beijing dialect
Eric	Chinese	Male, Sichuan dialect
Ryan	English	Male, energetic
Aiden	English	Male, clear, neutral
Ono_Anna	Japanese	Female
Sohee	Korean	Female

Voice Selection Guide:

Scenario	Recommended Voice
Chinese news/narration	Uncle_Fu
Chinese casual/lively	Eric
Chinese female, professional	Vivian
Chinese female, storytelling	Serena
English energetic content	Ryan
English neutral/educational	Aiden
Japanese content	Ono_Anna
Korean content	Sohee

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

"calm and warm" - Soft, friendly delivery
"news anchor, authoritative" - Professional broadcast style
"excited and energetic" - High energy, enthusiastic
"sad and melancholic" - Emotional, somber tone
"whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

"young cheerful female with high pitch"
"elderly wise male with deep resonant voice"
"professional female news anchor, clear articulation"
"friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

Use clean audio without background noise
5-10 seconds of speech works best
Provide accurate transcript of the reference
Reference and output language should match

CLI Parameters

Parameter	Required	Default	Description
`--text`	Yes	-	Text to synthesize
`--voice`	No	Vivian	Built-in voice (CustomVoice only)
`--lang_code`	No	auto	Language code
`--instruct`	No	-	Style control or voice description
`--speed`	No	1.0	Speech speed multiplier
`--temperature`	No	0.7	Sampling temperature (higher = more variation)
`--model`	No	(per mode)	Override default model
`--output`	No	-	Output file path
`--out-dir`	No	./outputs	Output directory when --output not set
`--ref_audio`	VoiceClone	-	Reference audio file
`--ref_text`	VoiceClone	-	Reference audio transcript

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

Metric	Value
Sample rate	24,000 Hz
Real-time factor	~0.7x (faster than real-time)
Peak memory	~4-6 GB
First run	Downloads model (~1-2GB)

Troubleshooting

Issue	Solution
Slow generation	Use 4-bit CustomVoice model
Unnatural pauses	Add punctuation, keep sentences short
Wrong language detected	Specify `--lang_code` explicitly
Voice cloning quality	Use cleaner reference audio, accurate transcript
Tokenizer warnings	Harmless, can be ignored
Out of memory	Close other apps, use 4-bit model