Covers: what it is, architecture diagram, prerequisites (Python 3.12, Bun, Rust), step-by-step install for macOS and WSL2, running backend + web frontend, first use (model download, voice profiles, story editor), Make commands reference, platform performance table, troubleshooting (proxy workaround, MPS/CUDA, transformers conflict), file structure, and relevance to LysnrAI
16 KiB
Voicebox — Local Voice Cloning Studio
Repo: github.com/jamiepine/voicebox · Version: 0.1.12 Stack: Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA) Local clone:
__LOCAL_LLMs/APPS/Voice/voicebox/(gitignored)
What Is Voicebox?
Voicebox is an open-source, local-first voice cloning studio powered by Qwen3-TTS. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.
┌──────────────────────────────────────────────────────────────────────┐
│ Voicebox Architecture │
│ │
│ ┌───────────────────────┐ ┌──────────────────────────┐ │
│ │ Web UI (Vite + React) │ │ Tauri Desktop App │ │
│ │ http://localhost:5173 │ OR │ (Rust + native window) │ │
│ └──────────┬────────────┘ └──────────┬───────────────┘ │
│ │ │ │
│ └──────────┐ ┌────────────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ FastAPI Backend │ │
│ │ http://localhost:17493│ │
│ │ │ │
│ │ • Qwen3-TTS model │ │
│ │ • Voice profiles │ │
│ │ • Audio generation │ │
│ │ • Story editor │ │
│ │ • SQLite database │ │
│ │ • REST API │ │
│ └─────────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ GPU Acceleration│ │
│ │ MPS (Mac) or │ │
│ │ CUDA (Windows) │ │
│ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Key Features
| Feature | Description |
|---|---|
| Voice cloning | Record or upload a few seconds of audio → create a voice profile |
| Text-to-speech | Type text, pick a voice, generate speech with Qwen3-TTS |
| Story editor | Multi-voice timeline for podcasts, narratives, audiobooks |
| Multi-track audio | DAW-like editing with multiple voices/tracks |
| REST API | Full API for integration (port 17493) |
| 100% local | No cloud, no data leaves your machine |
| Cross-platform | macOS (MLX Metal), Windows/Linux (PyTorch CUDA) |
Prerequisites
| Component | Required | Check Command |
|---|---|---|
| Python | 3.12 or 3.13 | python3.12 --version |
| Bun | ≥ 1.0 | bun --version |
| Rust | Latest stable (for Tauri desktop only) | rustc --version |
| Git | Any | git --version |
Platform-Specific
| Platform | GPU Backend | Extra Requirements |
|---|---|---|
| macOS (Apple Silicon) | MLX (Metal) | Xcode Command Line Tools |
| macOS (Intel) | CPU only | — |
| Windows/WSL2 | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
| Linux | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
Installation (Step-by-Step)
Step 1: Install Prerequisites
macOS
# Install Bun
brew install oven-sh/bun/bun
# Install Python 3.12 (if not present)
brew install python@3.12
# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Windows (WSL2)
# Install Bun
curl -fsSL https://bun.sh/install | bash
source ~/.bashrc
# Install Python 3.12
sudo apt install -y python3.12 python3.12-venv python3.12-dev
# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Step 2: Clone the Repository
# Clone into the APPS/Voice directory
cd /path/to/__LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
Current location on this Mac:
/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/
Step 3: Install JavaScript Dependencies
# Root workspace dependencies
bun install
# Web frontend dependencies (separate)
cd web && bun install && cd ..
Step 4: Install Python Dependencies
# Option A: Use the Makefile (recommended)
make setup-python
# Option B: Manual
python3.12 -m venv backend/venv
source backend/venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt
# Apple Silicon only — MLX for native Metal acceleration
pip install -r backend/requirements-mlx.txt
# Install Qwen3-TTS
pip install git+https://github.com/QwenLM/Qwen3-TTS.git
Step 5: Verify Installation
# Check Python venv
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"
# Check GPU backend
# macOS:
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Windows/Linux:
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Check MLX (Apple Silicon only)
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"
Running Voicebox
Option A: Web Frontend + Backend (Recommended for Development)
Open two terminals:
Terminal 1 — Backend:
cd /path/to/voicebox
make dev-backend
# Or manually:
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493
Expected output:
INFO: Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
INFO: Application startup complete.
INFO: GPU available: MPS (Apple Silicon)
Terminal 2 — Web Frontend:
cd /path/to/voicebox/web
bun run dev
Expected output:
VITE v5.4.21 ready in 2536 ms
➜ Local: http://localhost:5173/
➜ Network: use --host to expose
Open in browser: http://localhost:5173/
Option B: Tauri Desktop App
make dev
# This starts both backend + Tauri desktop window
Requires Rust toolchain installed.
Option C: Backend Only (API Mode)
make dev-backend
# API docs at: http://localhost:17493/docs
First Use
1. Download a Model
On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.
If the automatic download fails (corporate proxy, etc.):
# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
mkdir -p models/Qwen3-TTS
cd models/Qwen3-TTS
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
cd ../..
2. Create a Voice Profile
- Click "Voice Profiles" in the sidebar
- Click "New Profile"
- Record a few seconds of your voice — or upload an audio file (.wav, .mp3)
- Give it a name → Save
3. Generate Speech
- Click "Generate" in the sidebar
- Type your text in the input box
- Select a voice profile
- Click Generate
- Listen to the output, download as .wav
4. Story Editor
- Click "Stories" in the sidebar
- Create a new story
- Add segments with different voices
- Generate the full story as a single audio file
- Export for podcasts, audiobooks, etc.
Ports & URLs
| Service | URL | Purpose |
|---|---|---|
| Backend API | http://localhost:17493 | FastAPI server |
| API Docs | http://localhost:17493/docs | Swagger/OpenAPI docs |
| Web Frontend | http://localhost:5173 | Vite dev server (web mode) |
| Tauri App | Native window | Desktop app (if using Tauri) |
Make Commands Reference
| Command | Description |
|---|---|
make setup |
Full setup (JS + Python + MLX if Apple Silicon) |
make setup-js |
Install JavaScript dependencies only |
make setup-python |
Install Python dependencies + venv |
make dev |
Start backend + Tauri desktop app |
make dev-backend |
Start FastAPI backend only (port 17493) |
make dev-web |
Start backend + web frontend |
make kill-dev |
Kill all dev processes |
make build |
Build server binary + Tauri app |
make build-web |
Build web frontend to web/dist/ |
make db-init |
Initialize SQLite database |
make db-reset |
Reset database (delete + reinitialize) |
make generate-api |
Generate TypeScript API client from OpenAPI |
make lint |
Run Biome linter |
make format |
Format code with Biome |
make test |
Run all tests |
make clean |
Clean build artifacts |
make clean-all |
Nuclear clean (everything including node_modules) |
Platform Performance
| Platform | GPU Backend | Speed (est.) | Model Load Time |
|---|---|---|---|
| Mac M4 Pro 48GB | MLX (Metal) | Fast — real-time or faster | ~5s |
| Mac M4 Pro 48GB | PyTorch MPS | Good — near real-time | ~8s |
| RTX 5090 24GB | PyTorch CUDA | Fastest — well above real-time | ~3s |
| RTX 3060 12GB | PyTorch CUDA | Good — real-time | ~5s |
| CPU only (i7) | PyTorch CPU | Slow — below real-time | ~15s |
Troubleshooting
Backend won't start
# Check Python version (needs 3.12 or 3.13)
backend/venv/bin/python --version
# Check if port is in use
lsof -i :17493
# Try starting manually with verbose output
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug
Frontend won't start (ERR_MODULE_NOT_FOUND)
# Web dependencies need to be installed separately
cd web && bun install && cd ..
# Then start
cd web && bun run dev
Model download fails (corporate proxy)
# Use hf-mirror.com instead of huggingface.co
# See "First Use > Download a Model" section above
MPS not available (macOS)
# Check PyTorch MPS support
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"
# If False — you may need to update PyTorch
backend/venv/bin/pip install --upgrade torch
CUDA not available (Windows/WSL2)
# Check CUDA
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"
# If False — install CUDA PyTorch
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
transformers version conflict
mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3
This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.
Database issues
# Reset the database
make db-reset
# Or manually:
rm -f backend/data/voicebox.db
Kill everything
make kill-dev
# Or manually:
pkill -f "uvicorn" || true
pkill -f "vite" || true
File Structure
voicebox/
├── backend/ # FastAPI Python backend
│ ├── main.py # App entry point
│ ├── requirements.txt # Python deps
│ ├── requirements-mlx.txt # Apple Silicon MLX deps
│ ├── venv/ # Python virtual environment
│ └── data/voicebox.db # SQLite database
├── web/ # Vite + React web frontend
├── app/ # Shared app components
├── tauri/ # Tauri desktop app (Rust)
├── landing/ # Landing page
├── models/ # Downloaded TTS models
├── scripts/ # Build/setup scripts
├── Makefile # All commands
└── package.json # Bun workspace root
Relevance to LysnrAI
Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:
| Use Case | How |
|---|---|
| Voice profile testing | Clone voices locally before using in LysnrAI TTS pipeline |
| Audio content creation | Generate podcast/narration audio for LysnrAI content |
| TTS experimentation | Test Qwen3-TTS model quality and performance locally |
| API integration | Voicebox REST API (port 17493) could be called from LysnrAI services |
Quick Start (TL;DR)
# Clone
cd __LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
# Install
bun install && cd web && bun install && cd ..
make setup-python
# Run (two terminals)
make dev-backend # Terminal 1: backend on :17493
cd web && bun run dev # Terminal 2: frontend on :5173
# Open
open http://localhost:5173