- docs/README.md: documentation index with quick start, file structure, status table - docs/01-hardware-and-prerequisites.md: M4 Pro 48GB specs, toolchain inventory, disk budget, network environment (Forcepoint proxy details) - docs/03-whisper-cpp-setup.md: whisper-cpp installation, GGML model guide, ffmpeg audio conversion, CLI usage, real-time streaming, LysnrAI integration
5.7 KiB
03 — Whisper.cpp Setup
Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.
Installation
brew install whisper-cpp
- Version installed: 1.8.3
- Dependency installed: sdl2 2.32.10 (audio I/O)
- Binary location:
/opt/homebrew/bin/whisper-*
Installed Binaries
| Binary | Purpose |
|---|---|
whisper-cli |
Main CLI — transcribe audio files |
whisper-server |
HTTP server — POST audio, get JSON transcription |
whisper-stream |
Real-time microphone streaming transcription |
whisper-talk-llama |
Voice → Whisper → LLaMA → TTS pipeline |
whisper-bench |
Benchmark a model on your hardware |
whisper-command |
Voice command detection |
whisper-lsp |
Language Server Protocol integration |
whisper-quantize |
Quantize models to smaller formats |
whisper-vad-speech-segments |
Voice Activity Detection — split audio by speech |
Note: The binary is
whisper-cli, NOTwhisper-cpp. The brew formula name differs from the binary name.
GGML Model Files
Whisper.cpp requires separate GGML model files (not included with brew install).
Model Size Guide
| Model | File | Disk Size | Speed | Accuracy |
|---|---|---|---|---|
| Tiny (English) | ggml-tiny.en.bin |
75 MB | Blazing | Basic |
| Base (English) | ggml-base.en.bin |
142 MB | Very fast | Good |
| Medium (English) | ggml-medium.en.bin |
1.5 GB | Fast | Great |
| Large v3 | ggml-large-v3.bin |
3.1 GB | Moderate | Best |
| Large v3 Turbo | ggml-large-v3-turbo.bin |
1.6 GB | Fast | Best (distilled) |
Recommended for M4 Pro: ggml-large-v3-turbo — best accuracy at half the size of large-v3, Metal-accelerated.
Download Models
Models are stored in ~/whisper-models/.
mkdir -p ~/whisper-models
# Recommended: Large v3 Turbo (~1.6 GB)
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Alternative mirror
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin
Download sources:
Current Status (2026-02-19)
❌ Model download blocked by corporate proxy (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See 08-troubleshooting.md.
Audio Format Requirements
Whisper.cpp requires WAV format input (16-bit PCM, ideally 16 kHz mono).
ffmpeg Installation
brew install ffmpeg
Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)
Converting Audio Files
# m4a → wav (16kHz mono, optimal for Whisper)
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
# mp3 → wav
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
# Any format → wav
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav
Tested Conversion (2026-02-19)
ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
-ar 16000 -ac 1 \
'/Users/sd9235/Downloads/recording.wav'
# Result: 181 KB, 5.80 seconds, 16kHz mono
Usage
File Transcription
whisper-cli \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--language en \
--file /path/to/audio.wav
Real-Time Microphone Streaming
whisper-stream \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--language en
This is particularly relevant for LysnrAI — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.
HTTP Server Mode
# Start server on port 8080
whisper-server \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--port 8080
# POST audio to get transcription
curl -X POST http://localhost:8080/inference \
-F "file=@audio.wav" \
-F "response_format=json"
Voice Activity Detection
whisper-vad-speech-segments \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--file recording.wav
Benchmarking
whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin
Integration with LysnrAI
The local Whisper stack can serve as an offline fallback or dev replacement for Azure Speech SDK:
| Component | Azure (production) | Whisper.cpp (local dev) |
|---|---|---|
| Real-time STT | Azure Speech SDK | whisper-stream |
| File transcription | Azure Batch | whisper-cli |
| HTTP API | Azure REST API | whisper-server |
| Cost | Pay-per-use | $0.00 (local) |
| Latency | ~200ms (network) | ~50ms (local Metal) |
| Languages | 100+ | 100+ (same Whisper model) |
Potential Integration Points
- Desktop app (
src/audio/azure_stt.py): Add local Whisper backend option - iOS keyboard (
LysnrKeyboard/): Use on-device Whisper for offline dictation - Extraction service evals: Transcribe test audio fixtures locally