saravanakumardb1 9f6c216d0f docs(voicebox): add setup guide for local voice cloning studio

Covers: what it is, architecture diagram, prerequisites (Python 3.12,
Bun, Rust), step-by-step install for macOS and WSL2, running backend
+ web frontend, first use (model download, voice profiles, story editor),
Make commands reference, platform performance table, troubleshooting
(proxy workaround, MPS/CUDA, transformers conflict), file structure,
and relevance to LysnrAI

2026-02-22 15:45:32 -08:00

16 KiB

Raw Blame History

Voicebox — Local Voice Cloning Studio

Repo: github.com/jamiepine/voicebox · Version: 0.1.12 Stack: Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA) Local clone: __LOCAL_LLMs/APPS/Voice/voicebox/ (gitignored)

What Is Voicebox?

Voicebox is an open-source, local-first voice cloning studio powered by Qwen3-TTS. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.

┌──────────────────────────────────────────────────────────────────────┐
│ Voicebox Architecture                                                │
│                                                                      │
│  ┌───────────────────────┐          ┌──────────────────────────┐   │
│  │ Web UI (Vite + React) │          │ Tauri Desktop App        │   │
│  │ http://localhost:5173  │    OR    │ (Rust + native window)   │   │
│  └──────────┬────────────┘          └──────────┬───────────────┘   │
│             │                                   │                    │
│             └──────────┐   ┌────────────────────┘                   │
│                        ▼   ▼                                        │
│               ┌─────────────────────┐                               │
│               │ FastAPI Backend      │                               │
│               │ http://localhost:17493│                               │
│               │                     │                               │
│               │ • Qwen3-TTS model   │                               │
│               │ • Voice profiles     │                               │
│               │ • Audio generation   │                               │
│               │ • Story editor       │                               │
│               │ • SQLite database    │                               │
│               │ • REST API           │                               │
│               └─────────────────────┘                               │
│                        │                                             │
│               ┌────────┴────────┐                                   │
│               │ GPU Acceleration│                                   │
│               │ MPS (Mac) or    │                                   │
│               │ CUDA (Windows)  │                                   │
│               └─────────────────┘                                   │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Key Features

Feature	Description
Voice cloning	Record or upload a few seconds of audio → create a voice profile
Text-to-speech	Type text, pick a voice, generate speech with Qwen3-TTS
Story editor	Multi-voice timeline for podcasts, narratives, audiobooks
Multi-track audio	DAW-like editing with multiple voices/tracks
REST API	Full API for integration (port 17493)
100% local	No cloud, no data leaves your machine
Cross-platform	macOS (MLX Metal), Windows/Linux (PyTorch CUDA)

Prerequisites

Component	Required	Check Command
Python	3.12 or 3.13	`python3.12 --version`
Bun	≥ 1.0	`bun --version`
Rust	Latest stable (for Tauri desktop only)	`rustc --version`
Git	Any	`git --version`

Platform-Specific

Platform	GPU Backend	Extra Requirements
macOS (Apple Silicon)	MLX (Metal)	Xcode Command Line Tools
macOS (Intel)	CPU only	—
Windows/WSL2	PyTorch CUDA	NVIDIA drivers + CUDA toolkit
Linux	PyTorch CUDA	NVIDIA drivers + CUDA toolkit

Installation (Step-by-Step)

Step 1: Install Prerequisites

macOS

# Install Bun
brew install oven-sh/bun/bun

# Install Python 3.12 (if not present)
brew install python@3.12

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Windows (WSL2)

# Install Bun
curl -fsSL https://bun.sh/install | bash
source ~/.bashrc

# Install Python 3.12
sudo apt install -y python3.12 python3.12-venv python3.12-dev

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Step 2: Clone the Repository

# Clone into the APPS/Voice directory
cd /path/to/__LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

Current location on this Mac: /Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/

Step 3: Install JavaScript Dependencies

# Root workspace dependencies
bun install

# Web frontend dependencies (separate)
cd web && bun install && cd ..

Step 4: Install Python Dependencies

# Option A: Use the Makefile (recommended)
make setup-python

# Option B: Manual
python3.12 -m venv backend/venv
source backend/venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt

# Apple Silicon only — MLX for native Metal acceleration
pip install -r backend/requirements-mlx.txt

# Install Qwen3-TTS
pip install git+https://github.com/QwenLM/Qwen3-TTS.git

Step 5: Verify Installation

# Check Python venv
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"

# Check GPU backend
# macOS:
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Windows/Linux:
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Check MLX (Apple Silicon only)
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"

Running Voicebox

Option A: Web Frontend + Backend (Recommended for Development)

Open two terminals:

Terminal 1 — Backend:

cd /path/to/voicebox
make dev-backend
# Or manually:
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493

Expected output:

INFO:     Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
INFO:     Application startup complete.
INFO:     GPU available: MPS (Apple Silicon)

Terminal 2 — Web Frontend:

cd /path/to/voicebox/web
bun run dev

Expected output:

  VITE v5.4.21  ready in 2536 ms

  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose

Open in browser: http://localhost:5173/

Option B: Tauri Desktop App

make dev
# This starts both backend + Tauri desktop window

Requires Rust toolchain installed.

Option C: Backend Only (API Mode)

make dev-backend
# API docs at: http://localhost:17493/docs

First Use

1. Download a Model

On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.

If the automatic download fails (corporate proxy, etc.):

# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
mkdir -p models/Qwen3-TTS
cd models/Qwen3-TTS
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
cd ../..

2. Create a Voice Profile

Click "Voice Profiles" in the sidebar
Click "New Profile"
Record a few seconds of your voice — or upload an audio file (.wav, .mp3)
Give it a name → Save

3. Generate Speech

Click "Generate" in the sidebar
Type your text in the input box
Select a voice profile
Click Generate
Listen to the output, download as .wav

4. Story Editor

Click "Stories" in the sidebar
Create a new story
Add segments with different voices
Generate the full story as a single audio file
Export for podcasts, audiobooks, etc.

Ports & URLs

Service	URL	Purpose
Backend API	http://localhost:17493	FastAPI server
API Docs	http://localhost:17493/docs	Swagger/OpenAPI docs
Web Frontend	http://localhost:5173	Vite dev server (web mode)
Tauri App	Native window	Desktop app (if using Tauri)

Make Commands Reference

Command	Description
`make setup`	Full setup (JS + Python + MLX if Apple Silicon)
`make setup-js`	Install JavaScript dependencies only
`make setup-python`	Install Python dependencies + venv
`make dev`	Start backend + Tauri desktop app
`make dev-backend`	Start FastAPI backend only (port 17493)
`make dev-web`	Start backend + web frontend
`make kill-dev`	Kill all dev processes
`make build`	Build server binary + Tauri app
`make build-web`	Build web frontend to `web/dist/`
`make db-init`	Initialize SQLite database
`make db-reset`	Reset database (delete + reinitialize)
`make generate-api`	Generate TypeScript API client from OpenAPI
`make lint`	Run Biome linter
`make format`	Format code with Biome
`make test`	Run all tests
`make clean`	Clean build artifacts
`make clean-all`	Nuclear clean (everything including node_modules)

Platform Performance

Platform	GPU Backend	Speed (est.)	Model Load Time
Mac M4 Pro 48GB	MLX (Metal)	Fast — real-time or faster	~5s
Mac M4 Pro 48GB	PyTorch MPS	Good — near real-time	~8s
RTX 5090 24GB	PyTorch CUDA	Fastest — well above real-time	~3s
RTX 3060 12GB	PyTorch CUDA	Good — real-time	~5s
CPU only (i7)	PyTorch CPU	Slow — below real-time	~15s

Troubleshooting

Backend won't start

# Check Python version (needs 3.12 or 3.13)
backend/venv/bin/python --version

# Check if port is in use
lsof -i :17493

# Try starting manually with verbose output
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug

Frontend won't start (ERR_MODULE_NOT_FOUND)

# Web dependencies need to be installed separately
cd web && bun install && cd ..

# Then start
cd web && bun run dev

Model download fails (corporate proxy)

# Use hf-mirror.com instead of huggingface.co
# See "First Use > Download a Model" section above

MPS not available (macOS)

# Check PyTorch MPS support
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"

# If False — you may need to update PyTorch
backend/venv/bin/pip install --upgrade torch

CUDA not available (Windows/WSL2)

# Check CUDA
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"

# If False — install CUDA PyTorch
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121

transformers version conflict

mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3

This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.

Database issues

# Reset the database
make db-reset
# Or manually:
rm -f backend/data/voicebox.db

Kill everything

make kill-dev
# Or manually:
pkill -f "uvicorn" || true
pkill -f "vite" || true

File Structure

voicebox/
├── backend/                  # FastAPI Python backend
│   ├── main.py               # App entry point
│   ├── requirements.txt      # Python deps
│   ├── requirements-mlx.txt  # Apple Silicon MLX deps
│   ├── venv/                 # Python virtual environment
│   └── data/voicebox.db      # SQLite database
├── web/                      # Vite + React web frontend
├── app/                      # Shared app components
├── tauri/                    # Tauri desktop app (Rust)
├── landing/                  # Landing page
├── models/                   # Downloaded TTS models
├── scripts/                  # Build/setup scripts
├── Makefile                  # All commands
└── package.json              # Bun workspace root

Relevance to LysnrAI

Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:

Use Case	How
Voice profile testing	Clone voices locally before using in LysnrAI TTS pipeline
Audio content creation	Generate podcast/narration audio for LysnrAI content
TTS experimentation	Test Qwen3-TTS model quality and performance locally
API integration	Voicebox REST API (port 17493) could be called from LysnrAI services

Quick Start (TL;DR)

# Clone
cd __LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Install
bun install && cd web && bun install && cd ..
make setup-python

# Run (two terminals)
make dev-backend                    # Terminal 1: backend on :17493
cd web && bun run dev               # Terminal 2: frontend on :5173

# Open
open http://localhost:5173

16 KiB Raw Blame History