learning_ai_common_plat/__LOCAL_LLMs/VOICEBOX/VOICEBOX_SETUP.md
saravanakumardb1 9f6c216d0f docs(voicebox): add setup guide for local voice cloning studio
Covers: what it is, architecture diagram, prerequisites (Python 3.12,
Bun, Rust), step-by-step install for macOS and WSL2, running backend
+ web frontend, first use (model download, voice profiles, story editor),
Make commands reference, platform performance table, troubleshooting
(proxy workaround, MPS/CUDA, transformers conflict), file structure,
and relevance to LysnrAI
2026-02-22 15:45:32 -08:00

16 KiB

Voicebox — Local Voice Cloning Studio

Repo: github.com/jamiepine/voicebox · Version: 0.1.12 Stack: Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA) Local clone: __LOCAL_LLMs/APPS/Voice/voicebox/ (gitignored)


What Is Voicebox?

Voicebox is an open-source, local-first voice cloning studio powered by Qwen3-TTS. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.

┌──────────────────────────────────────────────────────────────────────┐
│ Voicebox Architecture                                                │
│                                                                      │
│  ┌───────────────────────┐          ┌──────────────────────────┐   │
│  │ Web UI (Vite + React) │          │ Tauri Desktop App        │   │
│  │ http://localhost:5173  │    OR    │ (Rust + native window)   │   │
│  └──────────┬────────────┘          └──────────┬───────────────┘   │
│             │                                   │                    │
│             └──────────┐   ┌────────────────────┘                   │
│                        ▼   ▼                                        │
│               ┌─────────────────────┐                               │
│               │ FastAPI Backend      │                               │
│               │ http://localhost:17493│                               │
│               │                     │                               │
│               │ • Qwen3-TTS model   │                               │
│               │ • Voice profiles     │                               │
│               │ • Audio generation   │                               │
│               │ • Story editor       │                               │
│               │ • SQLite database    │                               │
│               │ • REST API           │                               │
│               └─────────────────────┘                               │
│                        │                                             │
│               ┌────────┴────────┐                                   │
│               │ GPU Acceleration│                                   │
│               │ MPS (Mac) or    │                                   │
│               │ CUDA (Windows)  │                                   │
│               └─────────────────┘                                   │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Key Features

Feature Description
Voice cloning Record or upload a few seconds of audio → create a voice profile
Text-to-speech Type text, pick a voice, generate speech with Qwen3-TTS
Story editor Multi-voice timeline for podcasts, narratives, audiobooks
Multi-track audio DAW-like editing with multiple voices/tracks
REST API Full API for integration (port 17493)
100% local No cloud, no data leaves your machine
Cross-platform macOS (MLX Metal), Windows/Linux (PyTorch CUDA)

Prerequisites

Component Required Check Command
Python 3.12 or 3.13 python3.12 --version
Bun ≥ 1.0 bun --version
Rust Latest stable (for Tauri desktop only) rustc --version
Git Any git --version

Platform-Specific

Platform GPU Backend Extra Requirements
macOS (Apple Silicon) MLX (Metal) Xcode Command Line Tools
macOS (Intel) CPU only
Windows/WSL2 PyTorch CUDA NVIDIA drivers + CUDA toolkit
Linux PyTorch CUDA NVIDIA drivers + CUDA toolkit

Installation (Step-by-Step)

Step 1: Install Prerequisites

macOS

# Install Bun
brew install oven-sh/bun/bun

# Install Python 3.12 (if not present)
brew install python@3.12

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Windows (WSL2)

# Install Bun
curl -fsSL https://bun.sh/install | bash
source ~/.bashrc

# Install Python 3.12
sudo apt install -y python3.12 python3.12-venv python3.12-dev

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Step 2: Clone the Repository

# Clone into the APPS/Voice directory
cd /path/to/__LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

Current location on this Mac: /Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/

Step 3: Install JavaScript Dependencies

# Root workspace dependencies
bun install

# Web frontend dependencies (separate)
cd web && bun install && cd ..

Step 4: Install Python Dependencies

# Option A: Use the Makefile (recommended)
make setup-python

# Option B: Manual
python3.12 -m venv backend/venv
source backend/venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt

# Apple Silicon only — MLX for native Metal acceleration
pip install -r backend/requirements-mlx.txt

# Install Qwen3-TTS
pip install git+https://github.com/QwenLM/Qwen3-TTS.git

Step 5: Verify Installation

# Check Python venv
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"

# Check GPU backend
# macOS:
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Windows/Linux:
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Check MLX (Apple Silicon only)
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"

Running Voicebox

Open two terminals:

Terminal 1 — Backend:

cd /path/to/voicebox
make dev-backend
# Or manually:
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493

Expected output:

INFO:     Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
INFO:     Application startup complete.
INFO:     GPU available: MPS (Apple Silicon)

Terminal 2 — Web Frontend:

cd /path/to/voicebox/web
bun run dev

Expected output:

  VITE v5.4.21  ready in 2536 ms

  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose

Open in browser: http://localhost:5173/

Option B: Tauri Desktop App

make dev
# This starts both backend + Tauri desktop window

Requires Rust toolchain installed.

Option C: Backend Only (API Mode)

make dev-backend
# API docs at: http://localhost:17493/docs

First Use

1. Download a Model

On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.

If the automatic download fails (corporate proxy, etc.):

# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
mkdir -p models/Qwen3-TTS
cd models/Qwen3-TTS
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
cd ../..

2. Create a Voice Profile

  1. Click "Voice Profiles" in the sidebar
  2. Click "New Profile"
  3. Record a few seconds of your voice — or upload an audio file (.wav, .mp3)
  4. Give it a name → Save

3. Generate Speech

  1. Click "Generate" in the sidebar
  2. Type your text in the input box
  3. Select a voice profile
  4. Click Generate
  5. Listen to the output, download as .wav

4. Story Editor

  1. Click "Stories" in the sidebar
  2. Create a new story
  3. Add segments with different voices
  4. Generate the full story as a single audio file
  5. Export for podcasts, audiobooks, etc.

Ports & URLs

Service URL Purpose
Backend API http://localhost:17493 FastAPI server
API Docs http://localhost:17493/docs Swagger/OpenAPI docs
Web Frontend http://localhost:5173 Vite dev server (web mode)
Tauri App Native window Desktop app (if using Tauri)

Make Commands Reference

Command Description
make setup Full setup (JS + Python + MLX if Apple Silicon)
make setup-js Install JavaScript dependencies only
make setup-python Install Python dependencies + venv
make dev Start backend + Tauri desktop app
make dev-backend Start FastAPI backend only (port 17493)
make dev-web Start backend + web frontend
make kill-dev Kill all dev processes
make build Build server binary + Tauri app
make build-web Build web frontend to web/dist/
make db-init Initialize SQLite database
make db-reset Reset database (delete + reinitialize)
make generate-api Generate TypeScript API client from OpenAPI
make lint Run Biome linter
make format Format code with Biome
make test Run all tests
make clean Clean build artifacts
make clean-all Nuclear clean (everything including node_modules)

Platform Performance

Platform GPU Backend Speed (est.) Model Load Time
Mac M4 Pro 48GB MLX (Metal) Fast — real-time or faster ~5s
Mac M4 Pro 48GB PyTorch MPS Good — near real-time ~8s
RTX 5090 24GB PyTorch CUDA Fastest — well above real-time ~3s
RTX 3060 12GB PyTorch CUDA Good — real-time ~5s
CPU only (i7) PyTorch CPU Slow — below real-time ~15s

Troubleshooting

Backend won't start

# Check Python version (needs 3.12 or 3.13)
backend/venv/bin/python --version

# Check if port is in use
lsof -i :17493

# Try starting manually with verbose output
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug

Frontend won't start (ERR_MODULE_NOT_FOUND)

# Web dependencies need to be installed separately
cd web && bun install && cd ..

# Then start
cd web && bun run dev

Model download fails (corporate proxy)

# Use hf-mirror.com instead of huggingface.co
# See "First Use > Download a Model" section above

MPS not available (macOS)

# Check PyTorch MPS support
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"

# If False — you may need to update PyTorch
backend/venv/bin/pip install --upgrade torch

CUDA not available (Windows/WSL2)

# Check CUDA
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"

# If False — install CUDA PyTorch
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121

transformers version conflict

mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3

This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.

Database issues

# Reset the database
make db-reset
# Or manually:
rm -f backend/data/voicebox.db

Kill everything

make kill-dev
# Or manually:
pkill -f "uvicorn" || true
pkill -f "vite" || true

File Structure

voicebox/
├── backend/                  # FastAPI Python backend
│   ├── main.py               # App entry point
│   ├── requirements.txt      # Python deps
│   ├── requirements-mlx.txt  # Apple Silicon MLX deps
│   ├── venv/                 # Python virtual environment
│   └── data/voicebox.db      # SQLite database
├── web/                      # Vite + React web frontend
├── app/                      # Shared app components
├── tauri/                    # Tauri desktop app (Rust)
├── landing/                  # Landing page
├── models/                   # Downloaded TTS models
├── scripts/                  # Build/setup scripts
├── Makefile                  # All commands
└── package.json              # Bun workspace root

Relevance to LysnrAI

Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:

Use Case How
Voice profile testing Clone voices locally before using in LysnrAI TTS pipeline
Audio content creation Generate podcast/narration audio for LysnrAI content
TTS experimentation Test Qwen3-TTS model quality and performance locally
API integration Voicebox REST API (port 17493) could be called from LysnrAI services

Quick Start (TL;DR)

# Clone
cd __LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Install
bun install && cd web && bun install && cd ..
make setup-python

# Run (two terminals)
make dev-backend                    # Terminal 1: backend on :17493
cd web && bun run dev               # Terminal 2: frontend on :5173

# Open
open http://localhost:5173