Qwen3-TTS is a series of advanced speech generation models from Qwen, offering capabilities for voice cloning, voice design, ultra-high-quality human-like speech synthesis, and natural language-based voice control. It supports 10 major languages and multiple dialects, featuring strong contextual understanding for adaptive control of tone, speaking rate, and emotional expression based on instructions and text semantics. The models are built on a self-developed Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and semantic modeling, utilizing a universal end-to-end architecture with a discrete multi-codebook LM for enhanced versatility and performance. It also supports extreme low-latency streaming generation with an innovative Dual-Track hybrid architecture, achieving synthesis latency as low as 97ms. The intelligent text understanding enables flexible control over acoustic attributes through natural language instructions.

Key features include:

Voice clone (3-second rapid voice clone)
Voice design
Ultra-high-quality human-like speech generation
Natural language-based voice control
Multi-language and dialect support
Low-latency streaming generation

Models are available for download via Hugging Face and ModelScope. The project includes Python package usage for custom voice generation, voice design, and voice clone, with examples for environment setup, tokenizer encode/decode, and local web UI demo.

Qwen3-TTS

Qwen3-TTS

About

Code Example

Categories

Tags

Alternatives