# 03 — Whisper.cpp Setup

> Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.

---

## Installation

```bash
brew install whisper-cpp
```

- **Version installed:** 1.8.3
- **Dependency installed:** sdl2 2.32.10 (audio I/O)
- **Binary location:** `/opt/homebrew/bin/whisper-*`

### Installed Binaries

| Binary                        | Purpose                                          |
| ----------------------------- | ------------------------------------------------ |
| `whisper-cli`                 | **Main CLI** — transcribe audio files            |
| `whisper-server`              | HTTP server — POST audio, get JSON transcription |
| `whisper-stream`              | **Real-time** microphone streaming transcription |
| `whisper-talk-llama`          | Voice → Whisper → LLaMA → TTS pipeline           |
| `whisper-bench`               | Benchmark a model on your hardware               |
| `whisper-command`             | Voice command detection                          |
| `whisper-lsp`                 | Language Server Protocol integration             |
| `whisper-quantize`            | Quantize models to smaller formats               |
| `whisper-vad-speech-segments` | Voice Activity Detection — split audio by speech |

> **Note:** The binary is `whisper-cli`, NOT `whisper-cpp`. The brew formula name differs from the binary name.

---

## GGML Model Files

Whisper.cpp requires separate GGML model files (not included with brew install).

### Model Size Guide

| Model              | File                          | Disk Size  | Speed     | Accuracy             |
| ------------------ | ----------------------------- | ---------- | --------- | -------------------- |
| Tiny (English)     | `ggml-tiny.en.bin`            | 75 MB      | Blazing   | Basic                |
| Base (English)     | `ggml-base.en.bin`            | 142 MB     | Very fast | Good                 |
| Medium (English)   | `ggml-medium.en.bin`          | 1.5 GB     | Fast      | Great                |
| Large v3           | `ggml-large-v3.bin`           | 3.1 GB     | Moderate  | Best                 |
| **Large v3 Turbo** | **`ggml-large-v3-turbo.bin`** | **1.6 GB** | **Fast**  | **Best (distilled)** |

**Recommended for M4 Pro:** `ggml-large-v3-turbo` — best accuracy at half the size of large-v3, Metal-accelerated.

### Download Models

Models are stored in `~/whisper-models/`.

```bash
mkdir -p ~/whisper-models

# Recommended: Large v3 Turbo (~1.6 GB)
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# Alternative mirror
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin
```

**Download sources:**

- https://huggingface.co/ggerganov/whisper.cpp/tree/main
- https://ggml.ggerganov.com/

### Current Status (2026-02-19)

❌ **Model download blocked by corporate proxy** (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See [08-troubleshooting.md](08-troubleshooting.md).

---

## Audio Format Requirements

Whisper.cpp requires **WAV** format input (16-bit PCM, ideally 16 kHz mono).

### ffmpeg Installation

```bash
brew install ffmpeg
```

Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)

### Converting Audio Files

```bash
# m4a → wav (16kHz mono, optimal for Whisper)
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav

# mp3 → wav
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Any format → wav
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav
```

### Tested Conversion (2026-02-19)

```bash
ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
       -ar 16000 -ac 1 \
       '/Users/sd9235/Downloads/recording.wav'
# Result: 181 KB, 5.80 seconds, 16kHz mono
```

---

## Usage

### File Transcription

```bash
whisper-cli \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --language en \
  --file /path/to/audio.wav
```

### Real-Time Microphone Streaming

```bash
whisper-stream \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --language en
```

This is particularly relevant for **LysnrAI** — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.

### HTTP Server Mode

```bash
# Start server on port 8080
whisper-server \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --port 8080

# POST audio to get transcription
curl -X POST http://localhost:8080/inference \
  -F "file=@audio.wav" \
  -F "response_format=json"
```

### Voice Activity Detection

```bash
whisper-vad-speech-segments \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --file recording.wav
```

### Benchmarking

```bash
whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin
```

---

## Integration with LysnrAI

The local Whisper stack can serve as an **offline fallback** or **dev replacement** for Azure Speech SDK:

| Component          | Azure (production) | Whisper.cpp (local dev)   |
| ------------------ | ------------------ | ------------------------- |
| Real-time STT      | Azure Speech SDK   | `whisper-stream`          |
| File transcription | Azure Batch        | `whisper-cli`             |
| HTTP API           | Azure REST API     | `whisper-server`          |
| Cost               | Pay-per-use        | $0.00 (local)             |
| Latency            | ~200ms (network)   | ~50ms (local Metal)       |
| Languages          | 100+               | 100+ (same Whisper model) |

### Potential Integration Points

1. **Desktop app** (`src/audio/azure_stt.py`): Add local Whisper backend option
2. **iOS keyboard** (`LysnrKeyboard/`): Use on-device Whisper for offline dictation
3. **Extraction service evals**: Transcribe test audio fixtures locally