9.4 KiB
10 — Text-to-Speech (TTS) — Local Setup
Local TTS on Apple Silicon: Orpheus TTS via Ollama + Qwen3-TTS 0.6B direct. Works through corporate proxy via
hf-mirror.com. Last updated: 2026-02-21
Overview
Two TTS engines for local speech generation — both run fully offline after initial setup.
| Engine | Model | Size | How It Runs | Quality | Speed |
|---|---|---|---|---|---|
| Orpheus TTS | sematre/orpheus:en |
4 GB | Via Ollama (Metal GPU) | Great — expressive, 8 voices, emotion tags | ~11s for short sentences |
| Qwen3-TTS | Qwen3-TTS-12Hz-0.6B-CustomVoice |
1.2 GB | Direct Python (MPS/CPU) | Excellent — 10 languages, voice design | ~10-20s on MPS |
Architecture
Text → Ollama (Orpheus 3B) → Audio Tokens → SNAC Decoder → WAV file
Text → Qwen3-TTS 0.6B (PyTorch MPS) → WAV file
Quick Start (Fresh Laptop)
The one-shot setup script handles everything — works on any Apple Silicon Mac, including through corporate proxy:
cd __LOCAL_LLMs
bash setup-tts.sh
This installs: Python 3.12, venv, pip packages, Orpheus model (Ollama), SNAC decoder (hf-mirror.com), and optionally Qwen3-TTS 0.6B.
After setup:
.venv-qwen-tts/bin/python test_orpheus_tts.py
afplay test_orpheus_tara.wav
Prerequisites
| Component | How to Install | Notes |
|---|---|---|
| macOS + Apple Silicon | — | M1/M2/M3/M4 (MPS acceleration) |
| Homebrew | /bin/bash -c "$(curl -fsSL ...)" |
Package manager |
| Ollama | brew install ollama |
Local LLM server |
| Python 3.12 | brew install python@3.12 |
TTS packages need 3.12 |
All of the above are installed automatically by setup-tts.sh.
Manual Setup (step by step)
If you prefer to run each step yourself instead of setup-tts.sh:
1. Python Environment
cd __LOCAL_LLMs
# Install Python 3.12
brew install python@3.12
# Create isolated venv
/opt/homebrew/bin/python3.12 -m venv .venv-qwen-tts
# Install packages
.venv-qwen-tts/bin/pip install -U snac qwen-tts
2. Orpheus TTS Model (via Ollama)
ollama serve & # start Ollama if not running
ollama pull sematre/orpheus:en # 4 GB, via Ollama registry (works through proxy)
3. SNAC Audio Decoder
Downloads via hf-mirror.com — works through corporate proxy:
bash download-tts-models.sh snac # just SNAC (~76 MB)
Or manually:
mkdir -p models/snac_24khz
curl -k -sL -o models/snac_24khz/config.json \
"https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L --progress-bar -o models/snac_24khz/pytorch_model.bin \
"https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"
4. Qwen3-TTS 0.6B (optional)
bash download-tts-models.sh qwen # tokenizer + model (~1.7 GB)
After download everything runs fully offline.
Usage
Orpheus TTS (via Ollama)
# Make sure Ollama is running
ollama serve &
# Run test
.venv-qwen-tts/bin/python test_orpheus_tts.py
# Play output
afplay test_orpheus_tara.wav
Voices: tara, leah, jess, leo, dan, mia, zac, zoe
Emotion tags: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
# Example prompt format
voice = "tara"
text = "<laugh> That's hilarious! Tell me more."
prompt = f"<custom_token_3><|begin_of_text|>{voice}: {text}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>"
Qwen3-TTS (direct Python)
.venv-qwen-tts/bin/python test_qwen_tts.py
afplay test_output_english.wav
Features:
- 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- Built-in speaker voices (Chelsie, Vivian, Ryan, etc.)
- Natural language emotion control:
instruct="Speak with excitement" - Voice cloning from a short audio sample (with Base model variant)
File Inventory
__LOCAL_LLMs/
├── setup-tts.sh # ← START HERE — one-shot setup for fresh laptop
├── download-tts-models.sh # Download model weights (uses hf-mirror.com)
├── test_orpheus_tts.py # Orpheus TTS test (Ollama + SNAC)
├── test_qwen_tts.py # Qwen3-TTS test (direct Python)
├── .venv-qwen-tts/ # Python 3.12 venv (gitignored, created by setup)
├── models/ # Downloaded model weights (gitignored)
│ ├── snac_24khz/ # SNAC audio decoder (~76 MB)
│ ├── Qwen3-TTS-Tokenizer-12Hz/ # Qwen3-TTS tokenizer (optional)
│ └── Qwen3-TTS-12Hz-0.6B-CustomVoice/ # Qwen3-TTS model (~1.2 GB, optional)
└── *.wav # Generated audio output (gitignored)
OSS TTS Landscape (as of Feb 2026)
Speech-to-Text (STT)
| Model | By | Notes |
|---|---|---|
| Whisper / whisper-cpp | OpenAI / ggerganov | Gold standard, already installed, Metal-accelerated |
| Faster Whisper | SYSTRAN | 4× faster via CTranslate2 |
| Distil-Whisper | Hugging Face | 6× faster, 49% fewer params |
Text-to-Speech (TTS)
| Model | By | Size | Notes |
|---|---|---|---|
| Qwen3-TTS ⭐ | Alibaba | 0.6B–1.7B | Best quality, 10 languages, voice cloning, Jan 2026 |
| Orpheus TTS | Canopy AI | 3B | Expressive, 8 voices, emotion tags, available on Ollama |
| Kokoro | HF Community | 82M | Very fast, near-commercial quality, Apache 2.0 |
| Piper | Rhasspy | ONNX | Lightweight, runs on Raspberry Pi |
| F5-TTS | SWivid | — | Zero-shot voice cloning, flow matching |
| StyleTTS 2 | Columbia U | — | Human-level quality, style diffusion |
| OuteTTS | Community | — | Pure LLM-based TTS, runs via llama.cpp |
| Bark | Suno | — | Speech + music + sound effects |
Corporate Proxy Notes
| Source | Status | Workaround |
|---|---|---|
Ollama registry (registry.ollama.ai) |
✅ Works | Ollama pull uses its own CDN |
PyPI (via artifact.it.att.com) |
✅ Works | Corporate Artifactory mirror |
| GitHub releases | ✅ Works | Direct download |
HuggingFace (huggingface.co) |
❌ Blocked | Use hf-mirror.com as mirror (works through proxy) |
| hf-mirror.com (HF mirror) | ✅ Works | Chinese HF mirror, not blocked by Forcepoint |
Forcepoint CSO intercepts HTTPS and serves a block page for HuggingFace. No SSL workaround works for huggingface.co. However, hf-mirror.com (a Chinese mirror of HuggingFace) is not blocked and can be used to download model weights:
# Download SNAC config + weights via mirror
curl -k -L -o models/snac_24khz/config.json "https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L -o models/snac_24khz/pytorch_model.bin "https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"
All other sources (Ollama, pip, GitHub) also work fine through the proxy.
Troubleshooting
| Problem | Fix |
|---|---|
OSError: couldn't connect to huggingface.co |
Use hf-mirror.com or run bash setup-tts.sh |
SNAC decoder not found |
Run bash setup-tts.sh or bash download-tts-models.sh snac |
Model not found at models/Qwen3-TTS-* |
Run bash setup-tts.sh or bash download-tts-models.sh qwen |
| Orpheus generates no audio tokens | Ensure ollama serve is running and ollama list shows sematre/orpheus:en |
| MPS out of memory for Qwen3-TTS | Close other apps (Windsurf uses ~18 GB). Or use device="cpu" in test script |
| Slow generation on CPU | Expected for 0.6B model. MPS should be ~2-3× faster |