saravanakumardb1 f85b455eb5 ci: update CI/CD configuration

2026-02-21 14:13:07 -08:00

9.4 KiB

Raw Permalink Blame History

10 — Text-to-Speech (TTS) — Local Setup

Local TTS on Apple Silicon: Orpheus TTS via Ollama + Qwen3-TTS 0.6B direct. Works through corporate proxy via hf-mirror.com. Last updated: 2026-02-21

Overview

Two TTS engines for local speech generation — both run fully offline after initial setup.

Engine	Model	Size	How It Runs	Quality	Speed
Orpheus TTS	`sematre/orpheus:en`	4 GB	Via Ollama (Metal GPU)	Great — expressive, 8 voices, emotion tags	~11s for short sentences
Qwen3-TTS	`Qwen3-TTS-12Hz-0.6B-CustomVoice`	1.2 GB	Direct Python (MPS/CPU)	Excellent — 10 languages, voice design	~10-20s on MPS

Architecture

Text → Ollama (Orpheus 3B) → Audio Tokens → SNAC Decoder → WAV file
Text → Qwen3-TTS 0.6B (PyTorch MPS) → WAV file

Quick Start (Fresh Laptop)

The one-shot setup script handles everything — works on any Apple Silicon Mac, including through corporate proxy:

cd __LOCAL_LLMs
bash setup-tts.sh

This installs: Python 3.12, venv, pip packages, Orpheus model (Ollama), SNAC decoder (hf-mirror.com), and optionally Qwen3-TTS 0.6B.

After setup:

.venv-qwen-tts/bin/python test_orpheus_tts.py
afplay test_orpheus_tara.wav

Prerequisites

Component	How to Install	Notes
macOS + Apple Silicon	—	M1/M2/M3/M4 (MPS acceleration)
Homebrew	`/bin/bash -c "$(curl -fsSL ...)"`	Package manager
Ollama	`brew install ollama`	Local LLM server
Python 3.12	`brew install python@3.12`	TTS packages need 3.12

All of the above are installed automatically by setup-tts.sh.

Manual Setup (step by step)

If you prefer to run each step yourself instead of setup-tts.sh:

1. Python Environment

cd __LOCAL_LLMs

# Install Python 3.12
brew install python@3.12

# Create isolated venv
/opt/homebrew/bin/python3.12 -m venv .venv-qwen-tts

# Install packages
.venv-qwen-tts/bin/pip install -U snac qwen-tts

2. Orpheus TTS Model (via Ollama)

ollama serve &                          # start Ollama if not running
ollama pull sematre/orpheus:en          # 4 GB, via Ollama registry (works through proxy)

3. SNAC Audio Decoder

Downloads via hf-mirror.com — works through corporate proxy:

bash download-tts-models.sh snac       # just SNAC (~76 MB)

Or manually:

mkdir -p models/snac_24khz
curl -k -sL -o models/snac_24khz/config.json \
    "https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L --progress-bar -o models/snac_24khz/pytorch_model.bin \
    "https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

4. Qwen3-TTS 0.6B (optional)

bash download-tts-models.sh qwen       # tokenizer + model (~1.7 GB)

After download everything runs fully offline.

Usage

Orpheus TTS (via Ollama)

# Make sure Ollama is running
ollama serve &

# Run test
.venv-qwen-tts/bin/python test_orpheus_tts.py

# Play output
afplay test_orpheus_tara.wav

Voices: tara, leah, jess, leo, dan, mia, zac, zoe

Emotion tags: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>

# Example prompt format
voice = "tara"
text = "<laugh> That's hilarious! Tell me more."
prompt = f"<custom_token_3><|begin_of_text|>{voice}: {text}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>"

Qwen3-TTS (direct Python)

.venv-qwen-tts/bin/python test_qwen_tts.py
afplay test_output_english.wav

Features:

10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
Built-in speaker voices (Chelsie, Vivian, Ryan, etc.)
Natural language emotion control: instruct="Speak with excitement"
Voice cloning from a short audio sample (with Base model variant)

File Inventory

__LOCAL_LLMs/
├── setup-tts.sh                    # ← START HERE — one-shot setup for fresh laptop
├── download-tts-models.sh          # Download model weights (uses hf-mirror.com)
├── test_orpheus_tts.py             # Orpheus TTS test (Ollama + SNAC)
├── test_qwen_tts.py                # Qwen3-TTS test (direct Python)
├── .venv-qwen-tts/                 # Python 3.12 venv (gitignored, created by setup)
├── models/                         # Downloaded model weights (gitignored)
│   ├── snac_24khz/                 # SNAC audio decoder (~76 MB)
│   ├── Qwen3-TTS-Tokenizer-12Hz/  # Qwen3-TTS tokenizer (optional)
│   └── Qwen3-TTS-12Hz-0.6B-CustomVoice/  # Qwen3-TTS model (~1.2 GB, optional)
└── *.wav                           # Generated audio output (gitignored)

OSS TTS Landscape (as of Feb 2026)

Speech-to-Text (STT)

Model	By	Notes
Whisper / whisper-cpp	OpenAI / ggerganov	Gold standard, already installed, Metal-accelerated
Faster Whisper	SYSTRAN	4× faster via CTranslate2
Distil-Whisper	Hugging Face	6× faster, 49% fewer params

Text-to-Speech (TTS)

Model	By	Size	Notes
Qwen3-TTS ⭐	Alibaba	0.6B–1.7B	Best quality, 10 languages, voice cloning, Jan 2026
Orpheus TTS	Canopy AI	3B	Expressive, 8 voices, emotion tags, available on Ollama
Kokoro	HF Community	82M	Very fast, near-commercial quality, Apache 2.0
Piper	Rhasspy	ONNX	Lightweight, runs on Raspberry Pi
F5-TTS	SWivid	—	Zero-shot voice cloning, flow matching
StyleTTS 2	Columbia U	—	Human-level quality, style diffusion
OuteTTS	Community	—	Pure LLM-based TTS, runs via llama.cpp
Bark	Suno	—	Speech + music + sound effects

Corporate Proxy Notes

Source	Status	Workaround
Ollama registry (`registry.ollama.ai`)	✅ Works	Ollama pull uses its own CDN
PyPI (via `artifact.it.att.com`)	✅ Works	Corporate Artifactory mirror
GitHub releases	✅ Works	Direct download
HuggingFace (`huggingface.co`)	❌ Blocked	Use `hf-mirror.com` as mirror (works through proxy)
hf-mirror.com (HF mirror)	✅ Works	Chinese HF mirror, not blocked by Forcepoint

Forcepoint CSO intercepts HTTPS and serves a block page for HuggingFace. No SSL workaround works for huggingface.co. However, hf-mirror.com (a Chinese mirror of HuggingFace) is not blocked and can be used to download model weights:

# Download SNAC config + weights via mirror
curl -k -L -o models/snac_24khz/config.json "https://hf-mirror.com/hubertsiuzdak/snac_24khz/raw/main/config.json"
curl -k -L -o models/snac_24khz/pytorch_model.bin "https://hf-mirror.com/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

All other sources (Ollama, pip, GitHub) also work fine through the proxy.

Troubleshooting

Problem	Fix
`OSError: couldn't connect to huggingface.co`	Use `hf-mirror.com` or run `bash setup-tts.sh`
`SNAC decoder not found`	Run `bash setup-tts.sh` or `bash download-tts-models.sh snac`
`Model not found at models/Qwen3-TTS-*`	Run `bash setup-tts.sh` or `bash download-tts-models.sh qwen`
Orpheus generates no audio tokens	Ensure `ollama serve` is running and `ollama list` shows `sematre/orpheus:en`
MPS out of memory for Qwen3-TTS	Close other apps (Windsurf uses ~18 GB). Or use `device="cpu"` in test script
Slow generation on CPU	Expected for 0.6B model. MPS should be ~2-3× faster

9.4 KiB Raw Permalink Blame History Unescape Escape