docs(voicebox): add setup guide for local voice cloning studio

Covers: what it is, architecture diagram, prerequisites (Python 3.12, Bun, Rust), step-by-step install for macOS and WSL2, running backend + web frontend, first use (model download, voice profiles, story editor), Make commands reference, platform performance table, troubleshooting (proxy workaround, MPS/CUDA, transformers conflict), file structure, and relevance to LysnrAI
2026-02-22 15:45:32 -08:00 · 2026-02-22 15:45:32 -08:00 · 9f6c216d0f
commit 9f6c216d0f
parent c50f271e1c
1 changed files with 451 additions and 0 deletions
--- a/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md
+++ b/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md
@ -0,0 +1,451 @@
 # Voicebox — Local Voice Cloning Studio
 > **Repo:** [github.com/jamiepine/voicebox](https://github.com/jamiepine/voicebox) · **Version:** 0.1.12
 > **Stack:** Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA)
 > **Local clone:** `__LOCAL_LLMs/APPS/Voice/voicebox/` (gitignored)
 ---
 ## What Is Voicebox?
 Voicebox is an open-source, local-first voice cloning studio powered by **Qwen3-TTS**. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.
 ```
 ┌──────────────────────────────────────────────────────────────────────┐
 │ Voicebox Architecture                                                │
 │                                                                      │
 │  ┌───────────────────────┐          ┌──────────────────────────┐   │
 │  │ Web UI (Vite + React) │          │ Tauri Desktop App        │   │
 │  │ http://localhost:5173  │    OR    │ (Rust + native window)   │   │
 │  └──────────┬────────────┘          └──────────┬───────────────┘   │
 │             │                                   │                    │
 │             └──────────┐   ┌────────────────────┘                   │
 │                        ▼   ▼                                        │
 │               ┌─────────────────────┐                               │
 │               │ FastAPI Backend      │                               │
 │               │ http://localhost:17493│                               │
 │               │                     │                               │
 │               │ • Qwen3-TTS model   │                               │
 │               │ • Voice profiles     │                               │
 │               │ • Audio generation   │                               │
 │               │ • Story editor       │                               │
 │               │ • SQLite database    │                               │
 │               │ • REST API           │                               │
 │               └─────────────────────┘                               │
 │                        │                                             │
 │               ┌────────┴────────┐                                   │
 │               │ GPU Acceleration│                                   │
 │               │ MPS (Mac) or    │                                   │
 │               │ CUDA (Windows)  │                                   │
 │               └─────────────────┘                                   │
 │                                                                      │
 └──────────────────────────────────────────────────────────────────────┘
 ```
 ### Key Features
 | Feature               | Description                                                      |
 | --------------------- | ---------------------------------------------------------------- |
 | **Voice cloning**     | Record or upload a few seconds of audio → create a voice profile |
 | **Text-to-speech**    | Type text, pick a voice, generate speech with Qwen3-TTS          |
 | **Story editor**      | Multi-voice timeline for podcasts, narratives, audiobooks        |
 | **Multi-track audio** | DAW-like editing with multiple voices/tracks                     |
 | **REST API**          | Full API for integration (port 17493)                            |
 | **100% local**        | No cloud, no data leaves your machine                            |
 | **Cross-platform**    | macOS (MLX Metal), Windows/Linux (PyTorch CUDA)                  |
 ---
 ## Prerequisites
 | Component  | Required                               | Check Command          |
 | ---------- | -------------------------------------- | ---------------------- |
 | **Python** | 3.12 or 3.13                           | `python3.12 --version` |
 | **Bun**    | ≥ 1.0                                  | `bun --version`        |
 | **Rust**   | Latest stable (for Tauri desktop only) | `rustc --version`      |
 | **Git**    | Any                                    | `git --version`        |
 ### Platform-Specific
 | Platform                  | GPU Backend  | Extra Requirements            |
 | ------------------------- | ------------ | ----------------------------- |
 | **macOS (Apple Silicon)** | MLX (Metal)  | Xcode Command Line Tools      |
 | **macOS (Intel)**         | CPU only     | —                             |
 | **Windows/WSL2**          | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
 | **Linux**                 | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
 ---
 ## Installation (Step-by-Step)
 ### Step 1: Install Prerequisites
 #### macOS
 ```bash
 # Install Bun
 brew install oven-sh/bun/bun
 # Install Python 3.12 (if not present)
 brew install python@3.12
 # Rust (only needed for Tauri desktop app)
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```
 #### Windows (WSL2)
 ```bash
 # Install Bun
 curl -fsSL https://bun.sh/install | bash
 source ~/.bashrc
 # Install Python 3.12
 sudo apt install -y python3.12 python3.12-venv python3.12-dev
 # Rust (only needed for Tauri desktop app)
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```
 ### Step 2: Clone the Repository
 ```bash
 # Clone into the APPS/Voice directory
 cd /path/to/__LOCAL_LLMs/APPS/Voice
 git clone https://github.com/jamiepine/voicebox.git
 cd voicebox
 ```
 > **Current location on this Mac:** `/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/`
 ### Step 3: Install JavaScript Dependencies
 ```bash
 # Root workspace dependencies
 bun install
 # Web frontend dependencies (separate)
 cd web && bun install && cd ..
 ```
 ### Step 4: Install Python Dependencies
 ```bash
 # Option A: Use the Makefile (recommended)
 make setup-python
 # Option B: Manual
 python3.12 -m venv backend/venv
 source backend/venv/bin/activate
 pip install --upgrade pip
 pip install -r backend/requirements.txt
 # Apple Silicon only — MLX for native Metal acceleration
 pip install -r backend/requirements-mlx.txt
 # Install Qwen3-TTS
 pip install git+https://github.com/QwenLM/Qwen3-TTS.git
 ```
 ### Step 5: Verify Installation
 ```bash
 # Check Python venv
 backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
 backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"
 # Check GPU backend
 # macOS:
 backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
 # Windows/Linux:
 backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
 # Check MLX (Apple Silicon only)
 backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"
 ```
 ---
 ## Running Voicebox
 ### Option A: Web Frontend + Backend (Recommended for Development)
 Open **two terminals:**
 **Terminal 1 — Backend:**
 ```bash
 cd /path/to/voicebox
 make dev-backend
 # Or manually:
 # backend/venv/bin/uvicorn backend.main:app --reload --port 17493
 ```
 Expected output:
 ```
 INFO:     Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
 INFO:     Application startup complete.
 INFO:     GPU available: MPS (Apple Silicon)
 ```
 **Terminal 2 — Web Frontend:**
 ```bash
 cd /path/to/voicebox/web
 bun run dev
 ```
 Expected output:
 ```
  VITE v5.4.21  ready in 2536 ms
  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose
 ```
 **Open in browser:** [http://localhost:5173/](http://localhost:5173/)
 ### Option B: Tauri Desktop App
 ```bash
 make dev
 # This starts both backend + Tauri desktop window
 ```
 > Requires Rust toolchain installed.
 ### Option C: Backend Only (API Mode)
 ```bash
 make dev-backend
 # API docs at: http://localhost:17493/docs
 ```
 ---
 ## First Use
 ### 1. Download a Model
 On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.
 If the automatic download fails (corporate proxy, etc.):
 ```bash
 # Manual download via hf-mirror.com (bypasses Forcepoint proxy)
 mkdir -p models/Qwen3-TTS
 cd models/Qwen3-TTS
 curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
 curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
 curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
 cd ../..
 ```
 ### 2. Create a Voice Profile
 1. Click **"Voice Profiles"** in the sidebar
 2. Click **"New Profile"**
 3. **Record** a few seconds of your voice — or **upload** an audio file (.wav, .mp3)
 4. Give it a name → Save
 ### 3. Generate Speech
 1. Click **"Generate"** in the sidebar
 2. Type your text in the input box
 3. Select a voice profile
 4. Click **Generate**
 5. Listen to the output, download as .wav
 ### 4. Story Editor
 1. Click **"Stories"** in the sidebar
 2. Create a new story
 3. Add segments with different voices
 4. Generate the full story as a single audio file
 5. Export for podcasts, audiobooks, etc.
 ---
 ## Ports & URLs
 | Service          | URL                                                        | Purpose                      |
 | ---------------- | ---------------------------------------------------------- | ---------------------------- |
 | **Backend API**  | [http://localhost:17493](http://localhost:17493)           | FastAPI server               |
 | **API Docs**     | [http://localhost:17493/docs](http://localhost:17493/docs) | Swagger/OpenAPI docs         |
 | **Web Frontend** | [http://localhost:5173](http://localhost:5173)             | Vite dev server (web mode)   |
 | **Tauri App**    | Native window                                              | Desktop app (if using Tauri) |
 ---
 ## Make Commands Reference
 | Command             | Description                                       |
 | ------------------- | ------------------------------------------------- |
 | `make setup`        | Full setup (JS + Python + MLX if Apple Silicon)   |
 | `make setup-js`     | Install JavaScript dependencies only              |
 | `make setup-python` | Install Python dependencies + venv                |
 | `make dev`          | Start backend + Tauri desktop app                 |
 | `make dev-backend`  | Start FastAPI backend only (port 17493)           |
 | `make dev-web`      | Start backend + web frontend                      |
 | `make kill-dev`     | Kill all dev processes                            |
 | `make build`        | Build server binary + Tauri app                   |
 | `make build-web`    | Build web frontend to `web/dist/`                 |
 | `make db-init`      | Initialize SQLite database                        |
 | `make db-reset`     | Reset database (delete + reinitialize)            |
 | `make generate-api` | Generate TypeScript API client from OpenAPI       |
 | `make lint`         | Run Biome linter                                  |
 | `make format`       | Format code with Biome                            |
 | `make test`         | Run all tests                                     |
 | `make clean`        | Clean build artifacts                             |
 | `make clean-all`    | Nuclear clean (everything including node_modules) |
 ---
 ## Platform Performance
 | Platform            | GPU Backend  | Speed (est.)                   | Model Load Time |
 | ------------------- | ------------ | ------------------------------ | --------------- |
 | **Mac M4 Pro 48GB** | MLX (Metal)  | Fast — real-time or faster     | ~5s             |
 | **Mac M4 Pro 48GB** | PyTorch MPS  | Good — near real-time          | ~8s             |
 | **RTX 5090 24GB**   | PyTorch CUDA | Fastest — well above real-time | ~3s             |
 | **RTX 3060 12GB**   | PyTorch CUDA | Good — real-time               | ~5s             |
 | **CPU only (i7)**   | PyTorch CPU  | Slow — below real-time         | ~15s            |
 ---
 ## Troubleshooting
 ### Backend won't start
 ```bash
 # Check Python version (needs 3.12 or 3.13)
 backend/venv/bin/python --version
 # Check if port is in use
 lsof -i :17493
 # Try starting manually with verbose output
 backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug
 ```
 ### Frontend won't start (ERR_MODULE_NOT_FOUND)
 ```bash
 # Web dependencies need to be installed separately
 cd web && bun install && cd ..
 # Then start
 cd web && bun run dev
 ```
 ### Model download fails (corporate proxy)
 ```bash
 # Use hf-mirror.com instead of huggingface.co
 # See "First Use > Download a Model" section above
 ```
 ### MPS not available (macOS)
 ```bash
 # Check PyTorch MPS support
 backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"
 # If False — you may need to update PyTorch
 backend/venv/bin/pip install --upgrade torch
 ```
 ### CUDA not available (Windows/WSL2)
 ```bash
 # Check CUDA
 backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"
 # If False — install CUDA PyTorch
 backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
 ```
 ### transformers version conflict
 ```
 mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3
 ```
 This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.
 ### Database issues
 ```bash
 # Reset the database
 make db-reset
 # Or manually:
 rm -f backend/data/voicebox.db
 ```
 ### Kill everything
 ```bash
 make kill-dev
 # Or manually:
 pkill -f "uvicorn" || true
 pkill -f "vite" || true
 ```
 ---
 ## File Structure
 ```
 voicebox/
 ├── backend/                  # FastAPI Python backend
 │   ├── main.py               # App entry point
 │   ├── requirements.txt      # Python deps
 │   ├── requirements-mlx.txt  # Apple Silicon MLX deps
 │   ├── venv/                 # Python virtual environment
 │   └── data/voicebox.db      # SQLite database
 ├── web/                      # Vite + React web frontend
 ├── app/                      # Shared app components
 ├── tauri/                    # Tauri desktop app (Rust)
 ├── landing/                  # Landing page
 ├── models/                   # Downloaded TTS models
 ├── scripts/                  # Build/setup scripts
 ├── Makefile                  # All commands
 └── package.json              # Bun workspace root
 ```
 ---
 ## Relevance to LysnrAI
 Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:
 | Use Case                   | How                                                                  |
 | -------------------------- | -------------------------------------------------------------------- |
 | **Voice profile testing**  | Clone voices locally before using in LysnrAI TTS pipeline            |
 | **Audio content creation** | Generate podcast/narration audio for LysnrAI content                 |
 | **TTS experimentation**    | Test Qwen3-TTS model quality and performance locally                 |
 | **API integration**        | Voicebox REST API (port 17493) could be called from LysnrAI services |
 ---
 ## Quick Start (TL;DR)
 ```bash
 # Clone
 cd __LOCAL_LLMs/APPS/Voice
 git clone https://github.com/jamiepine/voicebox.git
 cd voicebox
 # Install
 bun install && cd web && bun install && cd ..
 make setup-python
 # Run (two terminals)
 make dev-backend                    # Terminal 1: backend on :17493
 cd web && bun run dev               # Terminal 2: frontend on :5173
 # Open
 open http://localhost:5173
 ```