docs(voicebox): add setup guide for local voice cloning studio
Covers: what it is, architecture diagram, prerequisites (Python 3.12, Bun, Rust), step-by-step install for macOS and WSL2, running backend + web frontend, first use (model download, voice profiles, story editor), Make commands reference, platform performance table, troubleshooting (proxy workaround, MPS/CUDA, transformers conflict), file structure, and relevance to LysnrAI
This commit is contained in:
parent
c50f271e1c
commit
9f6c216d0f
451
__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md
Normal file
451
__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md
Normal file
@ -0,0 +1,451 @@
|
|||||||
|
# Voicebox — Local Voice Cloning Studio
|
||||||
|
|
||||||
|
> **Repo:** [github.com/jamiepine/voicebox](https://github.com/jamiepine/voicebox) · **Version:** 0.1.12
|
||||||
|
> **Stack:** Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA)
|
||||||
|
> **Local clone:** `__LOCAL_LLMs/APPS/Voice/voicebox/` (gitignored)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Is Voicebox?
|
||||||
|
|
||||||
|
Voicebox is an open-source, local-first voice cloning studio powered by **Qwen3-TTS**. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Voicebox Architecture │
|
||||||
|
│ │
|
||||||
|
│ ┌───────────────────────┐ ┌──────────────────────────┐ │
|
||||||
|
│ │ Web UI (Vite + React) │ │ Tauri Desktop App │ │
|
||||||
|
│ │ http://localhost:5173 │ OR │ (Rust + native window) │ │
|
||||||
|
│ └──────────┬────────────┘ └──────────┬───────────────┘ │
|
||||||
|
│ │ │ │
|
||||||
|
│ └──────────┐ ┌────────────────────┘ │
|
||||||
|
│ ▼ ▼ │
|
||||||
|
│ ┌─────────────────────┐ │
|
||||||
|
│ │ FastAPI Backend │ │
|
||||||
|
│ │ http://localhost:17493│ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ • Qwen3-TTS model │ │
|
||||||
|
│ │ • Voice profiles │ │
|
||||||
|
│ │ • Audio generation │ │
|
||||||
|
│ │ • Story editor │ │
|
||||||
|
│ │ • SQLite database │ │
|
||||||
|
│ │ • REST API │ │
|
||||||
|
│ └─────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌────────┴────────┐ │
|
||||||
|
│ │ GPU Acceleration│ │
|
||||||
|
│ │ MPS (Mac) or │ │
|
||||||
|
│ │ CUDA (Windows) │ │
|
||||||
|
│ └─────────────────┘ │
|
||||||
|
│ │
|
||||||
|
└──────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Features
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
| --------------------- | ---------------------------------------------------------------- |
|
||||||
|
| **Voice cloning** | Record or upload a few seconds of audio → create a voice profile |
|
||||||
|
| **Text-to-speech** | Type text, pick a voice, generate speech with Qwen3-TTS |
|
||||||
|
| **Story editor** | Multi-voice timeline for podcasts, narratives, audiobooks |
|
||||||
|
| **Multi-track audio** | DAW-like editing with multiple voices/tracks |
|
||||||
|
| **REST API** | Full API for integration (port 17493) |
|
||||||
|
| **100% local** | No cloud, no data leaves your machine |
|
||||||
|
| **Cross-platform** | macOS (MLX Metal), Windows/Linux (PyTorch CUDA) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
| Component | Required | Check Command |
|
||||||
|
| ---------- | -------------------------------------- | ---------------------- |
|
||||||
|
| **Python** | 3.12 or 3.13 | `python3.12 --version` |
|
||||||
|
| **Bun** | ≥ 1.0 | `bun --version` |
|
||||||
|
| **Rust** | Latest stable (for Tauri desktop only) | `rustc --version` |
|
||||||
|
| **Git** | Any | `git --version` |
|
||||||
|
|
||||||
|
### Platform-Specific
|
||||||
|
|
||||||
|
| Platform | GPU Backend | Extra Requirements |
|
||||||
|
| ------------------------- | ------------ | ----------------------------- |
|
||||||
|
| **macOS (Apple Silicon)** | MLX (Metal) | Xcode Command Line Tools |
|
||||||
|
| **macOS (Intel)** | CPU only | — |
|
||||||
|
| **Windows/WSL2** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
|
||||||
|
| **Linux** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Installation (Step-by-Step)
|
||||||
|
|
||||||
|
### Step 1: Install Prerequisites
|
||||||
|
|
||||||
|
#### macOS
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Bun
|
||||||
|
brew install oven-sh/bun/bun
|
||||||
|
|
||||||
|
# Install Python 3.12 (if not present)
|
||||||
|
brew install python@3.12
|
||||||
|
|
||||||
|
# Rust (only needed for Tauri desktop app)
|
||||||
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Windows (WSL2)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Bun
|
||||||
|
curl -fsSL https://bun.sh/install | bash
|
||||||
|
source ~/.bashrc
|
||||||
|
|
||||||
|
# Install Python 3.12
|
||||||
|
sudo apt install -y python3.12 python3.12-venv python3.12-dev
|
||||||
|
|
||||||
|
# Rust (only needed for Tauri desktop app)
|
||||||
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Clone the Repository
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone into the APPS/Voice directory
|
||||||
|
cd /path/to/__LOCAL_LLMs/APPS/Voice
|
||||||
|
git clone https://github.com/jamiepine/voicebox.git
|
||||||
|
cd voicebox
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Current location on this Mac:** `/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/`
|
||||||
|
|
||||||
|
### Step 3: Install JavaScript Dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Root workspace dependencies
|
||||||
|
bun install
|
||||||
|
|
||||||
|
# Web frontend dependencies (separate)
|
||||||
|
cd web && bun install && cd ..
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Install Python Dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Option A: Use the Makefile (recommended)
|
||||||
|
make setup-python
|
||||||
|
|
||||||
|
# Option B: Manual
|
||||||
|
python3.12 -m venv backend/venv
|
||||||
|
source backend/venv/bin/activate
|
||||||
|
pip install --upgrade pip
|
||||||
|
pip install -r backend/requirements.txt
|
||||||
|
|
||||||
|
# Apple Silicon only — MLX for native Metal acceleration
|
||||||
|
pip install -r backend/requirements-mlx.txt
|
||||||
|
|
||||||
|
# Install Qwen3-TTS
|
||||||
|
pip install git+https://github.com/QwenLM/Qwen3-TTS.git
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Verify Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Python venv
|
||||||
|
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
|
||||||
|
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"
|
||||||
|
|
||||||
|
# Check GPU backend
|
||||||
|
# macOS:
|
||||||
|
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
|
||||||
|
# Windows/Linux:
|
||||||
|
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
|
||||||
|
|
||||||
|
# Check MLX (Apple Silicon only)
|
||||||
|
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running Voicebox
|
||||||
|
|
||||||
|
### Option A: Web Frontend + Backend (Recommended for Development)
|
||||||
|
|
||||||
|
Open **two terminals:**
|
||||||
|
|
||||||
|
**Terminal 1 — Backend:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /path/to/voicebox
|
||||||
|
make dev-backend
|
||||||
|
# Or manually:
|
||||||
|
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
```
|
||||||
|
INFO: Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
|
||||||
|
INFO: Application startup complete.
|
||||||
|
INFO: GPU available: MPS (Apple Silicon)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Terminal 2 — Web Frontend:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /path/to/voicebox/web
|
||||||
|
bun run dev
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
```
|
||||||
|
VITE v5.4.21 ready in 2536 ms
|
||||||
|
|
||||||
|
➜ Local: http://localhost:5173/
|
||||||
|
➜ Network: use --host to expose
|
||||||
|
```
|
||||||
|
|
||||||
|
**Open in browser:** [http://localhost:5173/](http://localhost:5173/)
|
||||||
|
|
||||||
|
### Option B: Tauri Desktop App
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make dev
|
||||||
|
# This starts both backend + Tauri desktop window
|
||||||
|
```
|
||||||
|
|
||||||
|
> Requires Rust toolchain installed.
|
||||||
|
|
||||||
|
### Option C: Backend Only (API Mode)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make dev-backend
|
||||||
|
# API docs at: http://localhost:17493/docs
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## First Use
|
||||||
|
|
||||||
|
### 1. Download a Model
|
||||||
|
|
||||||
|
On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.
|
||||||
|
|
||||||
|
If the automatic download fails (corporate proxy, etc.):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
|
||||||
|
mkdir -p models/Qwen3-TTS
|
||||||
|
cd models/Qwen3-TTS
|
||||||
|
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
|
||||||
|
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
|
||||||
|
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
|
||||||
|
cd ../..
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Create a Voice Profile
|
||||||
|
|
||||||
|
1. Click **"Voice Profiles"** in the sidebar
|
||||||
|
2. Click **"New Profile"**
|
||||||
|
3. **Record** a few seconds of your voice — or **upload** an audio file (.wav, .mp3)
|
||||||
|
4. Give it a name → Save
|
||||||
|
|
||||||
|
### 3. Generate Speech
|
||||||
|
|
||||||
|
1. Click **"Generate"** in the sidebar
|
||||||
|
2. Type your text in the input box
|
||||||
|
3. Select a voice profile
|
||||||
|
4. Click **Generate**
|
||||||
|
5. Listen to the output, download as .wav
|
||||||
|
|
||||||
|
### 4. Story Editor
|
||||||
|
|
||||||
|
1. Click **"Stories"** in the sidebar
|
||||||
|
2. Create a new story
|
||||||
|
3. Add segments with different voices
|
||||||
|
4. Generate the full story as a single audio file
|
||||||
|
5. Export for podcasts, audiobooks, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ports & URLs
|
||||||
|
|
||||||
|
| Service | URL | Purpose |
|
||||||
|
| ---------------- | ---------------------------------------------------------- | ---------------------------- |
|
||||||
|
| **Backend API** | [http://localhost:17493](http://localhost:17493) | FastAPI server |
|
||||||
|
| **API Docs** | [http://localhost:17493/docs](http://localhost:17493/docs) | Swagger/OpenAPI docs |
|
||||||
|
| **Web Frontend** | [http://localhost:5173](http://localhost:5173) | Vite dev server (web mode) |
|
||||||
|
| **Tauri App** | Native window | Desktop app (if using Tauri) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Make Commands Reference
|
||||||
|
|
||||||
|
| Command | Description |
|
||||||
|
| ------------------- | ------------------------------------------------- |
|
||||||
|
| `make setup` | Full setup (JS + Python + MLX if Apple Silicon) |
|
||||||
|
| `make setup-js` | Install JavaScript dependencies only |
|
||||||
|
| `make setup-python` | Install Python dependencies + venv |
|
||||||
|
| `make dev` | Start backend + Tauri desktop app |
|
||||||
|
| `make dev-backend` | Start FastAPI backend only (port 17493) |
|
||||||
|
| `make dev-web` | Start backend + web frontend |
|
||||||
|
| `make kill-dev` | Kill all dev processes |
|
||||||
|
| `make build` | Build server binary + Tauri app |
|
||||||
|
| `make build-web` | Build web frontend to `web/dist/` |
|
||||||
|
| `make db-init` | Initialize SQLite database |
|
||||||
|
| `make db-reset` | Reset database (delete + reinitialize) |
|
||||||
|
| `make generate-api` | Generate TypeScript API client from OpenAPI |
|
||||||
|
| `make lint` | Run Biome linter |
|
||||||
|
| `make format` | Format code with Biome |
|
||||||
|
| `make test` | Run all tests |
|
||||||
|
| `make clean` | Clean build artifacts |
|
||||||
|
| `make clean-all` | Nuclear clean (everything including node_modules) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Platform Performance
|
||||||
|
|
||||||
|
| Platform | GPU Backend | Speed (est.) | Model Load Time |
|
||||||
|
| ------------------- | ------------ | ------------------------------ | --------------- |
|
||||||
|
| **Mac M4 Pro 48GB** | MLX (Metal) | Fast — real-time or faster | ~5s |
|
||||||
|
| **Mac M4 Pro 48GB** | PyTorch MPS | Good — near real-time | ~8s |
|
||||||
|
| **RTX 5090 24GB** | PyTorch CUDA | Fastest — well above real-time | ~3s |
|
||||||
|
| **RTX 3060 12GB** | PyTorch CUDA | Good — real-time | ~5s |
|
||||||
|
| **CPU only (i7)** | PyTorch CPU | Slow — below real-time | ~15s |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Backend won't start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Python version (needs 3.12 or 3.13)
|
||||||
|
backend/venv/bin/python --version
|
||||||
|
|
||||||
|
# Check if port is in use
|
||||||
|
lsof -i :17493
|
||||||
|
|
||||||
|
# Try starting manually with verbose output
|
||||||
|
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug
|
||||||
|
```
|
||||||
|
|
||||||
|
### Frontend won't start (ERR_MODULE_NOT_FOUND)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Web dependencies need to be installed separately
|
||||||
|
cd web && bun install && cd ..
|
||||||
|
|
||||||
|
# Then start
|
||||||
|
cd web && bun run dev
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model download fails (corporate proxy)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use hf-mirror.com instead of huggingface.co
|
||||||
|
# See "First Use > Download a Model" section above
|
||||||
|
```
|
||||||
|
|
||||||
|
### MPS not available (macOS)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check PyTorch MPS support
|
||||||
|
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"
|
||||||
|
|
||||||
|
# If False — you may need to update PyTorch
|
||||||
|
backend/venv/bin/pip install --upgrade torch
|
||||||
|
```
|
||||||
|
|
||||||
|
### CUDA not available (Windows/WSL2)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check CUDA
|
||||||
|
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"
|
||||||
|
|
||||||
|
# If False — install CUDA PyTorch
|
||||||
|
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
|
||||||
|
```
|
||||||
|
|
||||||
|
### transformers version conflict
|
||||||
|
|
||||||
|
```
|
||||||
|
mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.
|
||||||
|
|
||||||
|
### Database issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Reset the database
|
||||||
|
make db-reset
|
||||||
|
# Or manually:
|
||||||
|
rm -f backend/data/voicebox.db
|
||||||
|
```
|
||||||
|
|
||||||
|
### Kill everything
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make kill-dev
|
||||||
|
# Or manually:
|
||||||
|
pkill -f "uvicorn" || true
|
||||||
|
pkill -f "vite" || true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
voicebox/
|
||||||
|
├── backend/ # FastAPI Python backend
|
||||||
|
│ ├── main.py # App entry point
|
||||||
|
│ ├── requirements.txt # Python deps
|
||||||
|
│ ├── requirements-mlx.txt # Apple Silicon MLX deps
|
||||||
|
│ ├── venv/ # Python virtual environment
|
||||||
|
│ └── data/voicebox.db # SQLite database
|
||||||
|
├── web/ # Vite + React web frontend
|
||||||
|
├── app/ # Shared app components
|
||||||
|
├── tauri/ # Tauri desktop app (Rust)
|
||||||
|
├── landing/ # Landing page
|
||||||
|
├── models/ # Downloaded TTS models
|
||||||
|
├── scripts/ # Build/setup scripts
|
||||||
|
├── Makefile # All commands
|
||||||
|
└── package.json # Bun workspace root
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Relevance to LysnrAI
|
||||||
|
|
||||||
|
Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:
|
||||||
|
|
||||||
|
| Use Case | How |
|
||||||
|
| -------------------------- | -------------------------------------------------------------------- |
|
||||||
|
| **Voice profile testing** | Clone voices locally before using in LysnrAI TTS pipeline |
|
||||||
|
| **Audio content creation** | Generate podcast/narration audio for LysnrAI content |
|
||||||
|
| **TTS experimentation** | Test Qwen3-TTS model quality and performance locally |
|
||||||
|
| **API integration** | Voicebox REST API (port 17493) could be called from LysnrAI services |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start (TL;DR)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone
|
||||||
|
cd __LOCAL_LLMs/APPS/Voice
|
||||||
|
git clone https://github.com/jamiepine/voicebox.git
|
||||||
|
cd voicebox
|
||||||
|
|
||||||
|
# Install
|
||||||
|
bun install && cd web && bun install && cd ..
|
||||||
|
make setup-python
|
||||||
|
|
||||||
|
# Run (two terminals)
|
||||||
|
make dev-backend # Terminal 1: backend on :17493
|
||||||
|
cd web && bun run dev # Terminal 2: frontend on :5173
|
||||||
|
|
||||||
|
# Open
|
||||||
|
open http://localhost:5173
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue
Block a user