learning_ai_common_plat/__LOCAL_LLMs/docs/10-text-to-speech.md
2026-02-21 14:13:07 -08:00

9.4 KiB
Raw Permalink Blame History

10 — Text-to-Speech (TTS) — Local Setup

Local TTS on Apple Silicon: Orpheus TTS via Ollama + Qwen3-TTS 0.6B direct. Works through corporate proxy via hf-mirror.com. Last updated: 2026-02-21


Overview

Two TTS engines for local speech generation — both run fully offline after initial setup.

Engine Model Size How It Runs Quality Speed
Orpheus TTS sematre/orpheus:en 4 GB Via Ollama (Metal GPU) Great — expressive, 8 voices, emotion tags ~11s for short sentences
Qwen3-TTS Qwen3-TTS-12Hz-0.6B-CustomVoice 1.2 GB Direct Python (MPS/CPU) Excellent — 10 languages, voice design ~10-20s on MPS

Architecture

Text → Ollama (Orpheus 3B) → Audio Tokens → SNAC Decoder → WAV file
Text → Qwen3-TTS 0.6B (PyTorch MPS) → WAV file

Quick Start (Fresh Laptop)

The one-shot setup script handles everything — works on any Apple Silicon Mac, including through corporate proxy:

cd __LOCAL_LLMs
bash setup-tts.sh

This installs: Python 3.12, venv, pip packages, Orpheus model (Ollama), SNAC decoder (hf-mirror.com), and optionally Qwen3-TTS 0.6B.

After setup:

.venv-qwen-tts/bin/python test_orpheus_tts.py
afplay test_orpheus_tara.wav

Prerequisites

Component How to Install Notes
macOS + Apple Silicon M1/M2/M3/M4 (MPS acceleration)
Homebrew /bin/bash -c "$(curl -fsSL ...)" Package manager
Ollama brew install ollama Local LLM server
Python 3.12 brew install python@3.12 TTS packages need 3.12

All of the above are installed automatically by setup-tts.sh.


Manual Setup (step by step)

If you prefer to run each step yourself instead of setup-tts.sh:

1. Python Environment

cd __LOCAL_LLMs

# Install Python 3.12
brew install python@3.12

# Create isolated venv
/opt/homebrew/bin/python3.12 -m venv .venv-qwen-tts

# Install packages
.venv-qwen-tts/bin/pip install -U snac qwen-tts

2. Orpheus TTS Model (via Ollama)

ollama serve &                          # start Ollama if not running
ollama pull sematre/orpheus:en          # 4 GB, via Ollama registry (works through proxy)

3. SNAC Audio Decoder

Downloads via hf-mirror.comworks through corporate proxy:

bash download-tts-models.sh snac       # just SNAC (~76 MB)

Or manually:

mkdir -p models/snac_24khz
curl -k -sL -o models/snac_24khz/config.json \
    "https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L --progress-bar -o models/snac_24khz/pytorch_model.bin \
    "https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

4. Qwen3-TTS 0.6B (optional)

bash download-tts-models.sh qwen       # tokenizer + model (~1.7 GB)

After download everything runs fully offline.


Usage

Orpheus TTS (via Ollama)

# Make sure Ollama is running
ollama serve &

# Run test
.venv-qwen-tts/bin/python test_orpheus_tts.py

# Play output
afplay test_orpheus_tara.wav

Voices: tara, leah, jess, leo, dan, mia, zac, zoe

Emotion tags: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>

# Example prompt format
voice = "tara"
text = "<laugh> That's hilarious! Tell me more."
prompt = f"<custom_token_3><|begin_of_text|>{voice}: {text}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>"

Qwen3-TTS (direct Python)

.venv-qwen-tts/bin/python test_qwen_tts.py
afplay test_output_english.wav

Features:

  • 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
  • Built-in speaker voices (Chelsie, Vivian, Ryan, etc.)
  • Natural language emotion control: instruct="Speak with excitement"
  • Voice cloning from a short audio sample (with Base model variant)

File Inventory

__LOCAL_LLMs/
├── setup-tts.sh                    # ← START HERE — one-shot setup for fresh laptop
├── download-tts-models.sh          # Download model weights (uses hf-mirror.com)
├── test_orpheus_tts.py             # Orpheus TTS test (Ollama + SNAC)
├── test_qwen_tts.py                # Qwen3-TTS test (direct Python)
├── .venv-qwen-tts/                 # Python 3.12 venv (gitignored, created by setup)
├── models/                         # Downloaded model weights (gitignored)
│   ├── snac_24khz/                 # SNAC audio decoder (~76 MB)
│   ├── Qwen3-TTS-Tokenizer-12Hz/  # Qwen3-TTS tokenizer (optional)
│   └── Qwen3-TTS-12Hz-0.6B-CustomVoice/  # Qwen3-TTS model (~1.2 GB, optional)
└── *.wav                           # Generated audio output (gitignored)

OSS TTS Landscape (as of Feb 2026)

Speech-to-Text (STT)

Model By Notes
Whisper / whisper-cpp OpenAI / ggerganov Gold standard, already installed, Metal-accelerated
Faster Whisper SYSTRAN 4× faster via CTranslate2
Distil-Whisper Hugging Face 6× faster, 49% fewer params

Text-to-Speech (TTS)

Model By Size Notes
Qwen3-TTS Alibaba 0.6B1.7B Best quality, 10 languages, voice cloning, Jan 2026
Orpheus TTS Canopy AI 3B Expressive, 8 voices, emotion tags, available on Ollama
Kokoro HF Community 82M Very fast, near-commercial quality, Apache 2.0
Piper Rhasspy ONNX Lightweight, runs on Raspberry Pi
F5-TTS SWivid Zero-shot voice cloning, flow matching
StyleTTS 2 Columbia U Human-level quality, style diffusion
OuteTTS Community Pure LLM-based TTS, runs via llama.cpp
Bark Suno Speech + music + sound effects

Corporate Proxy Notes

Source Status Workaround
Ollama registry (registry.ollama.ai) Works Ollama pull uses its own CDN
PyPI (via artifact.it.att.com) Works Corporate Artifactory mirror
GitHub releases Works Direct download
HuggingFace (huggingface.co) Blocked Use hf-mirror.com as mirror (works through proxy)
hf-mirror.com (HF mirror) Works Chinese HF mirror, not blocked by Forcepoint

Forcepoint CSO intercepts HTTPS and serves a block page for HuggingFace. No SSL workaround works for huggingface.co. However, hf-mirror.com (a Chinese mirror of HuggingFace) is not blocked and can be used to download model weights:

# Download SNAC config + weights via mirror
curl -k -L -o models/snac_24khz/config.json "https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L -o models/snac_24khz/pytorch_model.bin "https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

All other sources (Ollama, pip, GitHub) also work fine through the proxy.


Troubleshooting

Problem Fix
OSError: couldn't connect to huggingface.co Use hf-mirror.com or run bash setup-tts.sh
SNAC decoder not found Run bash setup-tts.sh or bash download-tts-models.sh snac
Model not found at models/Qwen3-TTS-* Run bash setup-tts.sh or bash download-tts-models.sh qwen
Orpheus generates no audio tokens Ensure ollama serve is running and ollama list shows sematre/orpheus:en
MPS out of memory for Qwen3-TTS Close other apps (Windsurf uses ~18 GB). Or use device="cpu" in test script
Slow generation on CPU Expected for 0.6B model. MPS should be ~2-3× faster