docs(voicebox): add setup guide for local voice cloning studio

Covers: what it is, architecture diagram, prerequisites (Python 3.12,
Bun, Rust), step-by-step install for macOS and WSL2, running backend
+ web frontend, first use (model download, voice profiles, story editor),
Make commands reference, platform performance table, troubleshooting
(proxy workaround, MPS/CUDA, transformers conflict), file structure,
and relevance to LysnrAI
This commit is contained in:
saravanakumardb1 2026-02-22 15:45:32 -08:00
parent c50f271e1c
commit 9f6c216d0f

View File

@ -0,0 +1,451 @@
# Voicebox — Local Voice Cloning Studio
> **Repo:** [github.com/jamiepine/voicebox](https://github.com/jamiepine/voicebox) · **Version:** 0.1.12
> **Stack:** Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA)
> **Local clone:** `__LOCAL_LLMs/APPS/Voice/voicebox/` (gitignored)
---
## What Is Voicebox?
Voicebox is an open-source, local-first voice cloning studio powered by **Qwen3-TTS**. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.
```
┌──────────────────────────────────────────────────────────────────────┐
│ Voicebox Architecture │
│ │
│ ┌───────────────────────┐ ┌──────────────────────────┐ │
│ │ Web UI (Vite + React) │ │ Tauri Desktop App │ │
│ │ http://localhost:5173 │ OR │ (Rust + native window) │ │
│ └──────────┬────────────┘ └──────────┬───────────────┘ │
│ │ │ │
│ └──────────┐ ┌────────────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ FastAPI Backend │ │
│ │ http://localhost:17493│ │
│ │ │ │
│ │ • Qwen3-TTS model │ │
│ │ • Voice profiles │ │
│ │ • Audio generation │ │
│ │ • Story editor │ │
│ │ • SQLite database │ │
│ │ • REST API │ │
│ └─────────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ GPU Acceleration│ │
│ │ MPS (Mac) or │ │
│ │ CUDA (Windows) │ │
│ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
### Key Features
| Feature | Description |
| --------------------- | ---------------------------------------------------------------- |
| **Voice cloning** | Record or upload a few seconds of audio → create a voice profile |
| **Text-to-speech** | Type text, pick a voice, generate speech with Qwen3-TTS |
| **Story editor** | Multi-voice timeline for podcasts, narratives, audiobooks |
| **Multi-track audio** | DAW-like editing with multiple voices/tracks |
| **REST API** | Full API for integration (port 17493) |
| **100% local** | No cloud, no data leaves your machine |
| **Cross-platform** | macOS (MLX Metal), Windows/Linux (PyTorch CUDA) |
---
## Prerequisites
| Component | Required | Check Command |
| ---------- | -------------------------------------- | ---------------------- |
| **Python** | 3.12 or 3.13 | `python3.12 --version` |
| **Bun** | ≥ 1.0 | `bun --version` |
| **Rust** | Latest stable (for Tauri desktop only) | `rustc --version` |
| **Git** | Any | `git --version` |
### Platform-Specific
| Platform | GPU Backend | Extra Requirements |
| ------------------------- | ------------ | ----------------------------- |
| **macOS (Apple Silicon)** | MLX (Metal) | Xcode Command Line Tools |
| **macOS (Intel)** | CPU only | — |
| **Windows/WSL2** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
| **Linux** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
---
## Installation (Step-by-Step)
### Step 1: Install Prerequisites
#### macOS
```bash
# Install Bun
brew install oven-sh/bun/bun
# Install Python 3.12 (if not present)
brew install python@3.12
# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
#### Windows (WSL2)
```bash
# Install Bun
curl -fsSL https://bun.sh/install | bash
source ~/.bashrc
# Install Python 3.12
sudo apt install -y python3.12 python3.12-venv python3.12-dev
# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Step 2: Clone the Repository
```bash
# Clone into the APPS/Voice directory
cd /path/to/__LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
```
> **Current location on this Mac:** `/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/`
### Step 3: Install JavaScript Dependencies
```bash
# Root workspace dependencies
bun install
# Web frontend dependencies (separate)
cd web && bun install && cd ..
```
### Step 4: Install Python Dependencies
```bash
# Option A: Use the Makefile (recommended)
make setup-python
# Option B: Manual
python3.12 -m venv backend/venv
source backend/venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt
# Apple Silicon only — MLX for native Metal acceleration
pip install -r backend/requirements-mlx.txt
# Install Qwen3-TTS
pip install git+https://github.com/QwenLM/Qwen3-TTS.git
```
### Step 5: Verify Installation
```bash
# Check Python venv
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"
# Check GPU backend
# macOS:
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Windows/Linux:
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Check MLX (Apple Silicon only)
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"
```
---
## Running Voicebox
### Option A: Web Frontend + Backend (Recommended for Development)
Open **two terminals:**
**Terminal 1 — Backend:**
```bash
cd /path/to/voicebox
make dev-backend
# Or manually:
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493
```
Expected output:
```
INFO: Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
INFO: Application startup complete.
INFO: GPU available: MPS (Apple Silicon)
```
**Terminal 2 — Web Frontend:**
```bash
cd /path/to/voicebox/web
bun run dev
```
Expected output:
```
VITE v5.4.21 ready in 2536 ms
➜ Local: http://localhost:5173/
➜ Network: use --host to expose
```
**Open in browser:** [http://localhost:5173/](http://localhost:5173/)
### Option B: Tauri Desktop App
```bash
make dev
# This starts both backend + Tauri desktop window
```
> Requires Rust toolchain installed.
### Option C: Backend Only (API Mode)
```bash
make dev-backend
# API docs at: http://localhost:17493/docs
```
---
## First Use
### 1. Download a Model
On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.
If the automatic download fails (corporate proxy, etc.):
```bash
# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
mkdir -p models/Qwen3-TTS
cd models/Qwen3-TTS
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
cd ../..
```
### 2. Create a Voice Profile
1. Click **"Voice Profiles"** in the sidebar
2. Click **"New Profile"**
3. **Record** a few seconds of your voice — or **upload** an audio file (.wav, .mp3)
4. Give it a name → Save
### 3. Generate Speech
1. Click **"Generate"** in the sidebar
2. Type your text in the input box
3. Select a voice profile
4. Click **Generate**
5. Listen to the output, download as .wav
### 4. Story Editor
1. Click **"Stories"** in the sidebar
2. Create a new story
3. Add segments with different voices
4. Generate the full story as a single audio file
5. Export for podcasts, audiobooks, etc.
---
## Ports & URLs
| Service | URL | Purpose |
| ---------------- | ---------------------------------------------------------- | ---------------------------- |
| **Backend API** | [http://localhost:17493](http://localhost:17493) | FastAPI server |
| **API Docs** | [http://localhost:17493/docs](http://localhost:17493/docs) | Swagger/OpenAPI docs |
| **Web Frontend** | [http://localhost:5173](http://localhost:5173) | Vite dev server (web mode) |
| **Tauri App** | Native window | Desktop app (if using Tauri) |
---
## Make Commands Reference
| Command | Description |
| ------------------- | ------------------------------------------------- |
| `make setup` | Full setup (JS + Python + MLX if Apple Silicon) |
| `make setup-js` | Install JavaScript dependencies only |
| `make setup-python` | Install Python dependencies + venv |
| `make dev` | Start backend + Tauri desktop app |
| `make dev-backend` | Start FastAPI backend only (port 17493) |
| `make dev-web` | Start backend + web frontend |
| `make kill-dev` | Kill all dev processes |
| `make build` | Build server binary + Tauri app |
| `make build-web` | Build web frontend to `web/dist/` |
| `make db-init` | Initialize SQLite database |
| `make db-reset` | Reset database (delete + reinitialize) |
| `make generate-api` | Generate TypeScript API client from OpenAPI |
| `make lint` | Run Biome linter |
| `make format` | Format code with Biome |
| `make test` | Run all tests |
| `make clean` | Clean build artifacts |
| `make clean-all` | Nuclear clean (everything including node_modules) |
---
## Platform Performance
| Platform | GPU Backend | Speed (est.) | Model Load Time |
| ------------------- | ------------ | ------------------------------ | --------------- |
| **Mac M4 Pro 48GB** | MLX (Metal) | Fast — real-time or faster | ~5s |
| **Mac M4 Pro 48GB** | PyTorch MPS | Good — near real-time | ~8s |
| **RTX 5090 24GB** | PyTorch CUDA | Fastest — well above real-time | ~3s |
| **RTX 3060 12GB** | PyTorch CUDA | Good — real-time | ~5s |
| **CPU only (i7)** | PyTorch CPU | Slow — below real-time | ~15s |
---
## Troubleshooting
### Backend won't start
```bash
# Check Python version (needs 3.12 or 3.13)
backend/venv/bin/python --version
# Check if port is in use
lsof -i :17493
# Try starting manually with verbose output
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug
```
### Frontend won't start (ERR_MODULE_NOT_FOUND)
```bash
# Web dependencies need to be installed separately
cd web && bun install && cd ..
# Then start
cd web && bun run dev
```
### Model download fails (corporate proxy)
```bash
# Use hf-mirror.com instead of huggingface.co
# See "First Use > Download a Model" section above
```
### MPS not available (macOS)
```bash
# Check PyTorch MPS support
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"
# If False — you may need to update PyTorch
backend/venv/bin/pip install --upgrade torch
```
### CUDA not available (Windows/WSL2)
```bash
# Check CUDA
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"
# If False — install CUDA PyTorch
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
```
### transformers version conflict
```
mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3
```
This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.
### Database issues
```bash
# Reset the database
make db-reset
# Or manually:
rm -f backend/data/voicebox.db
```
### Kill everything
```bash
make kill-dev
# Or manually:
pkill -f "uvicorn" || true
pkill -f "vite" || true
```
---
## File Structure
```
voicebox/
├── backend/ # FastAPI Python backend
│ ├── main.py # App entry point
│ ├── requirements.txt # Python deps
│ ├── requirements-mlx.txt # Apple Silicon MLX deps
│ ├── venv/ # Python virtual environment
│ └── data/voicebox.db # SQLite database
├── web/ # Vite + React web frontend
├── app/ # Shared app components
├── tauri/ # Tauri desktop app (Rust)
├── landing/ # Landing page
├── models/ # Downloaded TTS models
├── scripts/ # Build/setup scripts
├── Makefile # All commands
└── package.json # Bun workspace root
```
---
## Relevance to LysnrAI
Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:
| Use Case | How |
| -------------------------- | -------------------------------------------------------------------- |
| **Voice profile testing** | Clone voices locally before using in LysnrAI TTS pipeline |
| **Audio content creation** | Generate podcast/narration audio for LysnrAI content |
| **TTS experimentation** | Test Qwen3-TTS model quality and performance locally |
| **API integration** | Voicebox REST API (port 17493) could be called from LysnrAI services |
---
## Quick Start (TL;DR)
```bash
# Clone
cd __LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
# Install
bun install && cd web && bun install && cd ..
make setup-python
# Run (two terminals)
make dev-backend # Terminal 1: backend on :17493
cd web && bun run dev # Terminal 2: frontend on :5173
# Open
open http://localhost:5173
```