diff --git a/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md b/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md new file mode 100644 index 00000000..ada39ec6 --- /dev/null +++ b/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md @@ -0,0 +1,451 @@ +# Voicebox — Local Voice Cloning Studio + +> **Repo:** [github.com/jamiepine/voicebox](https://github.com/jamiepine/voicebox) · **Version:** 0.1.12 +> **Stack:** Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA) +> **Local clone:** `__LOCAL_LLMs/APPS/Voice/voicebox/` (gitignored) + +--- + +## What Is Voicebox? + +Voicebox is an open-source, local-first voice cloning studio powered by **Qwen3-TTS**. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs. + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Voicebox Architecture │ +│ │ +│ ┌───────────────────────┐ ┌──────────────────────────┐ │ +│ │ Web UI (Vite + React) │ │ Tauri Desktop App │ │ +│ │ http://localhost:5173 │ OR │ (Rust + native window) │ │ +│ └──────────┬────────────┘ └──────────┬───────────────┘ │ +│ │ │ │ +│ └──────────┐ ┌────────────────────┘ │ +│ ▼ ▼ │ +│ ┌─────────────────────┐ │ +│ │ FastAPI Backend │ │ +│ │ http://localhost:17493│ │ +│ │ │ │ +│ │ • Qwen3-TTS model │ │ +│ │ • Voice profiles │ │ +│ │ • Audio generation │ │ +│ │ • Story editor │ │ +│ │ • SQLite database │ │ +│ │ • REST API │ │ +│ └─────────────────────┘ │ +│ │ │ +│ ┌────────┴────────┐ │ +│ │ GPU Acceleration│ │ +│ │ MPS (Mac) or │ │ +│ │ CUDA (Windows) │ │ +│ └─────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### Key Features + +| Feature | Description | +| --------------------- | ---------------------------------------------------------------- | +| **Voice cloning** | Record or upload a few seconds of audio → create a voice profile | +| **Text-to-speech** | Type text, pick a voice, generate speech with Qwen3-TTS | +| **Story editor** | Multi-voice timeline for podcasts, narratives, audiobooks | +| **Multi-track audio** | DAW-like editing with multiple voices/tracks | +| **REST API** | Full API for integration (port 17493) | +| **100% local** | No cloud, no data leaves your machine | +| **Cross-platform** | macOS (MLX Metal), Windows/Linux (PyTorch CUDA) | + +--- + +## Prerequisites + +| Component | Required | Check Command | +| ---------- | -------------------------------------- | ---------------------- | +| **Python** | 3.12 or 3.13 | `python3.12 --version` | +| **Bun** | ≥ 1.0 | `bun --version` | +| **Rust** | Latest stable (for Tauri desktop only) | `rustc --version` | +| **Git** | Any | `git --version` | + +### Platform-Specific + +| Platform | GPU Backend | Extra Requirements | +| ------------------------- | ------------ | ----------------------------- | +| **macOS (Apple Silicon)** | MLX (Metal) | Xcode Command Line Tools | +| **macOS (Intel)** | CPU only | — | +| **Windows/WSL2** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit | +| **Linux** | PyTorch CUDA | NVIDIA drivers + CUDA toolkit | + +--- + +## Installation (Step-by-Step) + +### Step 1: Install Prerequisites + +#### macOS + +```bash +# Install Bun +brew install oven-sh/bun/bun + +# Install Python 3.12 (if not present) +brew install python@3.12 + +# Rust (only needed for Tauri desktop app) +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +#### Windows (WSL2) + +```bash +# Install Bun +curl -fsSL https://bun.sh/install | bash +source ~/.bashrc + +# Install Python 3.12 +sudo apt install -y python3.12 python3.12-venv python3.12-dev + +# Rust (only needed for Tauri desktop app) +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +### Step 2: Clone the Repository + +```bash +# Clone into the APPS/Voice directory +cd /path/to/__LOCAL_LLMs/APPS/Voice +git clone https://github.com/jamiepine/voicebox.git +cd voicebox +``` + +> **Current location on this Mac:** `/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/` + +### Step 3: Install JavaScript Dependencies + +```bash +# Root workspace dependencies +bun install + +# Web frontend dependencies (separate) +cd web && bun install && cd .. +``` + +### Step 4: Install Python Dependencies + +```bash +# Option A: Use the Makefile (recommended) +make setup-python + +# Option B: Manual +python3.12 -m venv backend/venv +source backend/venv/bin/activate +pip install --upgrade pip +pip install -r backend/requirements.txt + +# Apple Silicon only — MLX for native Metal acceleration +pip install -r backend/requirements-mlx.txt + +# Install Qwen3-TTS +pip install git+https://github.com/QwenLM/Qwen3-TTS.git +``` + +### Step 5: Verify Installation + +```bash +# Check Python venv +backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')" +backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')" + +# Check GPU backend +# macOS: +backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')" +# Windows/Linux: +backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" + +# Check MLX (Apple Silicon only) +backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')" +``` + +--- + +## Running Voicebox + +### Option A: Web Frontend + Backend (Recommended for Development) + +Open **two terminals:** + +**Terminal 1 — Backend:** + +```bash +cd /path/to/voicebox +make dev-backend +# Or manually: +# backend/venv/bin/uvicorn backend.main:app --reload --port 17493 +``` + +Expected output: + +``` +INFO: Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit) +INFO: Application startup complete. +INFO: GPU available: MPS (Apple Silicon) +``` + +**Terminal 2 — Web Frontend:** + +```bash +cd /path/to/voicebox/web +bun run dev +``` + +Expected output: + +``` + VITE v5.4.21 ready in 2536 ms + + ➜ Local: http://localhost:5173/ + ➜ Network: use --host to expose +``` + +**Open in browser:** [http://localhost:5173/](http://localhost:5173/) + +### Option B: Tauri Desktop App + +```bash +make dev +# This starts both backend + Tauri desktop window +``` + +> Requires Rust toolchain installed. + +### Option C: Backend Only (API Mode) + +```bash +make dev-backend +# API docs at: http://localhost:17493/docs +``` + +--- + +## First Use + +### 1. Download a Model + +On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB. + +If the automatic download fails (corporate proxy, etc.): + +```bash +# Manual download via hf-mirror.com (bypasses Forcepoint proxy) +mkdir -p models/Qwen3-TTS +cd models/Qwen3-TTS +curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json" +curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors" +curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json" +cd ../.. +``` + +### 2. Create a Voice Profile + +1. Click **"Voice Profiles"** in the sidebar +2. Click **"New Profile"** +3. **Record** a few seconds of your voice — or **upload** an audio file (.wav, .mp3) +4. Give it a name → Save + +### 3. Generate Speech + +1. Click **"Generate"** in the sidebar +2. Type your text in the input box +3. Select a voice profile +4. Click **Generate** +5. Listen to the output, download as .wav + +### 4. Story Editor + +1. Click **"Stories"** in the sidebar +2. Create a new story +3. Add segments with different voices +4. Generate the full story as a single audio file +5. Export for podcasts, audiobooks, etc. + +--- + +## Ports & URLs + +| Service | URL | Purpose | +| ---------------- | ---------------------------------------------------------- | ---------------------------- | +| **Backend API** | [http://localhost:17493](http://localhost:17493) | FastAPI server | +| **API Docs** | [http://localhost:17493/docs](http://localhost:17493/docs) | Swagger/OpenAPI docs | +| **Web Frontend** | [http://localhost:5173](http://localhost:5173) | Vite dev server (web mode) | +| **Tauri App** | Native window | Desktop app (if using Tauri) | + +--- + +## Make Commands Reference + +| Command | Description | +| ------------------- | ------------------------------------------------- | +| `make setup` | Full setup (JS + Python + MLX if Apple Silicon) | +| `make setup-js` | Install JavaScript dependencies only | +| `make setup-python` | Install Python dependencies + venv | +| `make dev` | Start backend + Tauri desktop app | +| `make dev-backend` | Start FastAPI backend only (port 17493) | +| `make dev-web` | Start backend + web frontend | +| `make kill-dev` | Kill all dev processes | +| `make build` | Build server binary + Tauri app | +| `make build-web` | Build web frontend to `web/dist/` | +| `make db-init` | Initialize SQLite database | +| `make db-reset` | Reset database (delete + reinitialize) | +| `make generate-api` | Generate TypeScript API client from OpenAPI | +| `make lint` | Run Biome linter | +| `make format` | Format code with Biome | +| `make test` | Run all tests | +| `make clean` | Clean build artifacts | +| `make clean-all` | Nuclear clean (everything including node_modules) | + +--- + +## Platform Performance + +| Platform | GPU Backend | Speed (est.) | Model Load Time | +| ------------------- | ------------ | ------------------------------ | --------------- | +| **Mac M4 Pro 48GB** | MLX (Metal) | Fast — real-time or faster | ~5s | +| **Mac M4 Pro 48GB** | PyTorch MPS | Good — near real-time | ~8s | +| **RTX 5090 24GB** | PyTorch CUDA | Fastest — well above real-time | ~3s | +| **RTX 3060 12GB** | PyTorch CUDA | Good — real-time | ~5s | +| **CPU only (i7)** | PyTorch CPU | Slow — below real-time | ~15s | + +--- + +## Troubleshooting + +### Backend won't start + +```bash +# Check Python version (needs 3.12 or 3.13) +backend/venv/bin/python --version + +# Check if port is in use +lsof -i :17493 + +# Try starting manually with verbose output +backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug +``` + +### Frontend won't start (ERR_MODULE_NOT_FOUND) + +```bash +# Web dependencies need to be installed separately +cd web && bun install && cd .. + +# Then start +cd web && bun run dev +``` + +### Model download fails (corporate proxy) + +```bash +# Use hf-mirror.com instead of huggingface.co +# See "First Use > Download a Model" section above +``` + +### MPS not available (macOS) + +```bash +# Check PyTorch MPS support +backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())" + +# If False — you may need to update PyTorch +backend/venv/bin/pip install --upgrade torch +``` + +### CUDA not available (Windows/WSL2) + +```bash +# Check CUDA +backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())" + +# If False — install CUDA PyTorch +backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121 +``` + +### transformers version conflict + +``` +mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3 +``` + +This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS. + +### Database issues + +```bash +# Reset the database +make db-reset +# Or manually: +rm -f backend/data/voicebox.db +``` + +### Kill everything + +```bash +make kill-dev +# Or manually: +pkill -f "uvicorn" || true +pkill -f "vite" || true +``` + +--- + +## File Structure + +``` +voicebox/ +├── backend/ # FastAPI Python backend +│ ├── main.py # App entry point +│ ├── requirements.txt # Python deps +│ ├── requirements-mlx.txt # Apple Silicon MLX deps +│ ├── venv/ # Python virtual environment +│ └── data/voicebox.db # SQLite database +├── web/ # Vite + React web frontend +├── app/ # Shared app components +├── tauri/ # Tauri desktop app (Rust) +├── landing/ # Landing page +├── models/ # Downloaded TTS models +├── scripts/ # Build/setup scripts +├── Makefile # All commands +└── package.json # Bun workspace root +``` + +--- + +## Relevance to LysnrAI + +Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for: + +| Use Case | How | +| -------------------------- | -------------------------------------------------------------------- | +| **Voice profile testing** | Clone voices locally before using in LysnrAI TTS pipeline | +| **Audio content creation** | Generate podcast/narration audio for LysnrAI content | +| **TTS experimentation** | Test Qwen3-TTS model quality and performance locally | +| **API integration** | Voicebox REST API (port 17493) could be called from LysnrAI services | + +--- + +## Quick Start (TL;DR) + +```bash +# Clone +cd __LOCAL_LLMs/APPS/Voice +git clone https://github.com/jamiepine/voicebox.git +cd voicebox + +# Install +bun install && cd web && bun install && cd .. +make setup-python + +# Run (two terminals) +make dev-backend # Terminal 1: backend on :17493 +cd web && bun run dev # Terminal 2: frontend on :5173 + +# Open +open http://localhost:5173 +```