# Voicebox — Local Voice Cloning Studio

> **Repo:** [github.com/jamiepine/voicebox](https://github.com/jamiepine/voicebox) · **Version:** 0.1.12
> **Stack:** Tauri (Rust) + FastAPI (Python) + Qwen3-TTS + MLX (Apple Silicon) / PyTorch (CUDA)
> **Local clone:** `__LOCAL_LLMs/APPS/Voice/voicebox/` (gitignored)

---

## What Is Voicebox?

Voicebox is an open-source, local-first voice cloning studio powered by **Qwen3-TTS**. It provides a DAW-like interface for professional voice synthesis — a local alternative to cloud services like ElevenLabs.

```
┌──────────────────────────────────────────────────────────────────────┐
│ Voicebox Architecture                                                │
│                                                                      │
│  ┌───────────────────────┐          ┌──────────────────────────┐   │
│  │ Web UI (Vite + React) │          │ Tauri Desktop App        │   │
│  │ http://localhost:5173  │    OR    │ (Rust + native window)   │   │
│  └──────────┬────────────┘          └──────────┬───────────────┘   │
│             │                                   │                    │
│             └──────────┐   ┌────────────────────┘                   │
│                        ▼   ▼                                        │
│               ┌─────────────────────┐                               │
│               │ FastAPI Backend      │                               │
│               │ http://localhost:17493│                               │
│               │                     │                               │
│               │ • Qwen3-TTS model   │                               │
│               │ • Voice profiles     │                               │
│               │ • Audio generation   │                               │
│               │ • Story editor       │                               │
│               │ • SQLite database    │                               │
│               │ • REST API           │                               │
│               └─────────────────────┘                               │
│                        │                                             │
│               ┌────────┴────────┐                                   │
│               │ GPU Acceleration│                                   │
│               │ MPS (Mac) or    │                                   │
│               │ CUDA (Windows)  │                                   │
│               └─────────────────┘                                   │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
```

### Key Features

| Feature               | Description                                                      |
| --------------------- | ---------------------------------------------------------------- |
| **Voice cloning**     | Record or upload a few seconds of audio → create a voice profile |
| **Text-to-speech**    | Type text, pick a voice, generate speech with Qwen3-TTS          |
| **Story editor**      | Multi-voice timeline for podcasts, narratives, audiobooks        |
| **Multi-track audio** | DAW-like editing with multiple voices/tracks                     |
| **REST API**          | Full API for integration (port 17493)                            |
| **100% local**        | No cloud, no data leaves your machine                            |
| **Cross-platform**    | macOS (MLX Metal), Windows/Linux (PyTorch CUDA)                  |

---

## Prerequisites

| Component  | Required                               | Check Command          |
| ---------- | -------------------------------------- | ---------------------- |
| **Python** | 3.12 or 3.13                           | `python3.12 --version` |
| **Bun**    | ≥ 1.0                                  | `bun --version`        |
| **Rust**   | Latest stable (for Tauri desktop only) | `rustc --version`      |
| **Git**    | Any                                    | `git --version`        |

### Platform-Specific

| Platform                  | GPU Backend  | Extra Requirements            |
| ------------------------- | ------------ | ----------------------------- |
| **macOS (Apple Silicon)** | MLX (Metal)  | Xcode Command Line Tools      |
| **macOS (Intel)**         | CPU only     | —                             |
| **Windows/WSL2**          | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |
| **Linux**                 | PyTorch CUDA | NVIDIA drivers + CUDA toolkit |

---

## Installation (Step-by-Step)

### Step 1: Install Prerequisites

#### macOS

```bash
# Install Bun
brew install oven-sh/bun/bun

# Install Python 3.12 (if not present)
brew install python@3.12

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

#### Windows (WSL2)

```bash
# Install Bun
curl -fsSL https://bun.sh/install | bash
source ~/.bashrc

# Install Python 3.12
sudo apt install -y python3.12 python3.12-venv python3.12-dev

# Rust (only needed for Tauri desktop app)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

### Step 2: Clone the Repository

```bash
# Clone into the APPS/Voice directory
cd /path/to/__LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
```

> **Current location on this Mac:** `/Users/sd9235/code/mygh/learning_ai_common_plat/__LOCAL_LLMs/APPS/Voice/voicebox/`

### Step 3: Install JavaScript Dependencies

```bash
# Root workspace dependencies
bun install

# Web frontend dependencies (separate)
cd web && bun install && cd ..
```

### Step 4: Install Python Dependencies

```bash
# Option A: Use the Makefile (recommended)
make setup-python

# Option B: Manual
python3.12 -m venv backend/venv
source backend/venv/bin/activate
pip install --upgrade pip
pip install -r backend/requirements.txt

# Apple Silicon only — MLX for native Metal acceleration
pip install -r backend/requirements-mlx.txt

# Install Qwen3-TTS
pip install git+https://github.com/QwenLM/Qwen3-TTS.git
```

### Step 5: Verify Installation

```bash
# Check Python venv
backend/venv/bin/python -c "import torch; print(f'PyTorch: {torch.__version__}')"
backend/venv/bin/python -c "import fastapi; print(f'FastAPI: {fastapi.__version__}')"

# Check GPU backend
# macOS:
backend/venv/bin/python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Windows/Linux:
backend/venv/bin/python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Check MLX (Apple Silicon only)
backend/venv/bin/python -c "import mlx; print(f'MLX: {mlx.__version__}')"
```

---

## Running Voicebox

### Option A: Web Frontend + Backend (Recommended for Development)

Open **two terminals:**

**Terminal 1 — Backend:**

```bash
cd /path/to/voicebox
make dev-backend
# Or manually:
# backend/venv/bin/uvicorn backend.main:app --reload --port 17493
```

Expected output:

```
INFO:     Uvicorn running on http://0.0.0.0:17493 (Press CTRL+C to quit)
INFO:     Application startup complete.
INFO:     GPU available: MPS (Apple Silicon)
```

**Terminal 2 — Web Frontend:**

```bash
cd /path/to/voicebox/web
bun run dev
```

Expected output:

```
  VITE v5.4.21  ready in 2536 ms

  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose
```

**Open in browser:** [http://localhost:5173/](http://localhost:5173/)

### Option B: Tauri Desktop App

```bash
make dev
# This starts both backend + Tauri desktop window
```

> Requires Rust toolchain installed.

### Option C: Backend Only (API Mode)

```bash
make dev-backend
# API docs at: http://localhost:17493/docs
```

---

## First Use

### 1. Download a Model

On first launch, Voicebox will prompt you to download the Qwen3-TTS model. This is ~1.7 GB.

If the automatic download fails (corporate proxy, etc.):

```bash
# Manual download via hf-mirror.com (bypasses Forcepoint proxy)
mkdir -p models/Qwen3-TTS
cd models/Qwen3-TTS
curl -k -L -o config.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/config.json"
curl -k -L -o model.safetensors "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/model.safetensors"
curl -k -L -o tokenizer.json "https://hf-mirror.com/Qwen/Qwen3-TTS-0.6B/resolve/main/tokenizer.json"
cd ../..
```

### 2. Create a Voice Profile

1. Click **"Voice Profiles"** in the sidebar
2. Click **"New Profile"**
3. **Record** a few seconds of your voice — or **upload** an audio file (.wav, .mp3)
4. Give it a name → Save

### 3. Generate Speech

1. Click **"Generate"** in the sidebar
2. Type your text in the input box
3. Select a voice profile
4. Click **Generate**
5. Listen to the output, download as .wav

### 4. Story Editor

1. Click **"Stories"** in the sidebar
2. Create a new story
3. Add segments with different voices
4. Generate the full story as a single audio file
5. Export for podcasts, audiobooks, etc.

---

## Ports & URLs

| Service          | URL                                                        | Purpose                      |
| ---------------- | ---------------------------------------------------------- | ---------------------------- |
| **Backend API**  | [http://localhost:17493](http://localhost:17493)           | FastAPI server               |
| **API Docs**     | [http://localhost:17493/docs](http://localhost:17493/docs) | Swagger/OpenAPI docs         |
| **Web Frontend** | [http://localhost:5173](http://localhost:5173)             | Vite dev server (web mode)   |
| **Tauri App**    | Native window                                              | Desktop app (if using Tauri) |

---

## Make Commands Reference

| Command             | Description                                       |
| ------------------- | ------------------------------------------------- |
| `make setup`        | Full setup (JS + Python + MLX if Apple Silicon)   |
| `make setup-js`     | Install JavaScript dependencies only              |
| `make setup-python` | Install Python dependencies + venv                |
| `make dev`          | Start backend + Tauri desktop app                 |
| `make dev-backend`  | Start FastAPI backend only (port 17493)           |
| `make dev-web`      | Start backend + web frontend                      |
| `make kill-dev`     | Kill all dev processes                            |
| `make build`        | Build server binary + Tauri app                   |
| `make build-web`    | Build web frontend to `web/dist/`                 |
| `make db-init`      | Initialize SQLite database                        |
| `make db-reset`     | Reset database (delete + reinitialize)            |
| `make generate-api` | Generate TypeScript API client from OpenAPI       |
| `make lint`         | Run Biome linter                                  |
| `make format`       | Format code with Biome                            |
| `make test`         | Run all tests                                     |
| `make clean`        | Clean build artifacts                             |
| `make clean-all`    | Nuclear clean (everything including node_modules) |

---

## Platform Performance

| Platform            | GPU Backend  | Speed (est.)                   | Model Load Time |
| ------------------- | ------------ | ------------------------------ | --------------- |
| **Mac M4 Pro 48GB** | MLX (Metal)  | Fast — real-time or faster     | ~5s             |
| **Mac M4 Pro 48GB** | PyTorch MPS  | Good — near real-time          | ~8s             |
| **RTX 5090 24GB**   | PyTorch CUDA | Fastest — well above real-time | ~3s             |
| **RTX 3060 12GB**   | PyTorch CUDA | Good — real-time               | ~5s             |
| **CPU only (i7)**   | PyTorch CPU  | Slow — below real-time         | ~15s            |

---

## Troubleshooting

### Backend won't start

```bash
# Check Python version (needs 3.12 or 3.13)
backend/venv/bin/python --version

# Check if port is in use
lsof -i :17493

# Try starting manually with verbose output
backend/venv/bin/uvicorn backend.main:app --reload --port 17493 --log-level debug
```

### Frontend won't start (ERR_MODULE_NOT_FOUND)

```bash
# Web dependencies need to be installed separately
cd web && bun install && cd ..

# Then start
cd web && bun run dev
```

### Model download fails (corporate proxy)

```bash
# Use hf-mirror.com instead of huggingface.co
# See "First Use > Download a Model" section above
```

### MPS not available (macOS)

```bash
# Check PyTorch MPS support
backend/venv/bin/python -c "import torch; print(torch.backends.mps.is_available())"

# If False — you may need to update PyTorch
backend/venv/bin/pip install --upgrade torch
```

### CUDA not available (Windows/WSL2)

```bash
# Check CUDA
backend/venv/bin/python -c "import torch; print(torch.cuda.is_available())"

# If False — install CUDA PyTorch
backend/venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### transformers version conflict

```
mlx-audio 0.3.1 requires transformers==5.0.0rc3, but you have transformers 4.57.3
```

This is a warning, not a blocking error. Everything still works. The MLX-audio package pins a pre-release version of transformers — the stable version is fine for Qwen3-TTS.

### Database issues

```bash
# Reset the database
make db-reset
# Or manually:
rm -f backend/data/voicebox.db
```

### Kill everything

```bash
make kill-dev
# Or manually:
pkill -f "uvicorn" || true
pkill -f "vite" || true
```

---

## File Structure

```
voicebox/
├── backend/                  # FastAPI Python backend
│   ├── main.py               # App entry point
│   ├── requirements.txt      # Python deps
│   ├── requirements-mlx.txt  # Apple Silicon MLX deps
│   ├── venv/                 # Python virtual environment
│   └── data/voicebox.db      # SQLite database
├── web/                      # Vite + React web frontend
├── app/                      # Shared app components
├── tauri/                    # Tauri desktop app (Rust)
├── landing/                  # Landing page
├── models/                   # Downloaded TTS models
├── scripts/                  # Build/setup scripts
├── Makefile                  # All commands
└── package.json              # Bun workspace root
```

---

## Relevance to LysnrAI

Voicebox is a standalone tool — not integrated into LysnrAI. However, it's useful for:

| Use Case                   | How                                                                  |
| -------------------------- | -------------------------------------------------------------------- |
| **Voice profile testing**  | Clone voices locally before using in LysnrAI TTS pipeline            |
| **Audio content creation** | Generate podcast/narration audio for LysnrAI content                 |
| **TTS experimentation**    | Test Qwen3-TTS model quality and performance locally                 |
| **API integration**        | Voicebox REST API (port 17493) could be called from LysnrAI services |

---

## Quick Start (TL;DR)

```bash
# Clone
cd __LOCAL_LLMs/APPS/Voice
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Install
bun install && cd web && bun install && cd ..
make setup-python

# Run (two terminals)
make dev-backend                    # Terminal 1: backend on :17493
cd web && bun run dev               # Terminal 2: frontend on :5173

# Open
open http://localhost:5173
```