learning_ai_common_plat/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md
saravanakumardb1 464ffb92ec docs(local-llm): add docs index, hardware specs, and whisper-cpp setup
- docs/README.md: documentation index with quick start, file structure, status table
- docs/01-hardware-and-prerequisites.md: M4 Pro 48GB specs, toolchain inventory,
  disk budget, network environment (Forcepoint proxy details)
- docs/03-whisper-cpp-setup.md: whisper-cpp installation, GGML model guide,
  ffmpeg audio conversion, CLI usage, real-time streaming, LysnrAI integration
2026-02-19 13:00:48 -08:00

5.7 KiB

03 — Whisper.cpp Setup

Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.


Installation

brew install whisper-cpp
  • Version installed: 1.8.3
  • Dependency installed: sdl2 2.32.10 (audio I/O)
  • Binary location: /opt/homebrew/bin/whisper-*

Installed Binaries

Binary Purpose
whisper-cli Main CLI — transcribe audio files
whisper-server HTTP server — POST audio, get JSON transcription
whisper-stream Real-time microphone streaming transcription
whisper-talk-llama Voice → Whisper → LLaMA → TTS pipeline
whisper-bench Benchmark a model on your hardware
whisper-command Voice command detection
whisper-lsp Language Server Protocol integration
whisper-quantize Quantize models to smaller formats
whisper-vad-speech-segments Voice Activity Detection — split audio by speech

Note: The binary is whisper-cli, NOT whisper-cpp. The brew formula name differs from the binary name.


GGML Model Files

Whisper.cpp requires separate GGML model files (not included with brew install).

Model Size Guide

Model File Disk Size Speed Accuracy
Tiny (English) ggml-tiny.en.bin 75 MB Blazing Basic
Base (English) ggml-base.en.bin 142 MB Very fast Good
Medium (English) ggml-medium.en.bin 1.5 GB Fast Great
Large v3 ggml-large-v3.bin 3.1 GB Moderate Best
Large v3 Turbo ggml-large-v3-turbo.bin 1.6 GB Fast Best (distilled)

Recommended for M4 Pro: ggml-large-v3-turbo — best accuracy at half the size of large-v3, Metal-accelerated.

Download Models

Models are stored in ~/whisper-models/.

mkdir -p ~/whisper-models

# Recommended: Large v3 Turbo (~1.6 GB)
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# Alternative mirror
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin

Download sources:

Current Status (2026-02-19)

Model download blocked by corporate proxy (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See 08-troubleshooting.md.


Audio Format Requirements

Whisper.cpp requires WAV format input (16-bit PCM, ideally 16 kHz mono).

ffmpeg Installation

brew install ffmpeg

Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)

Converting Audio Files

# m4a → wav (16kHz mono, optimal for Whisper)
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav

# mp3 → wav
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Any format → wav
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav

Tested Conversion (2026-02-19)

ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
       -ar 16000 -ac 1 \
       '/Users/sd9235/Downloads/recording.wav'
# Result: 181 KB, 5.80 seconds, 16kHz mono

Usage

File Transcription

whisper-cli \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --language en \
  --file /path/to/audio.wav

Real-Time Microphone Streaming

whisper-stream \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --language en

This is particularly relevant for LysnrAI — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.

HTTP Server Mode

# Start server on port 8080
whisper-server \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --port 8080

# POST audio to get transcription
curl -X POST http://localhost:8080/inference \
  -F "file=@audio.wav" \
  -F "response_format=json"

Voice Activity Detection

whisper-vad-speech-segments \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --file recording.wav

Benchmarking

whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin

Integration with LysnrAI

The local Whisper stack can serve as an offline fallback or dev replacement for Azure Speech SDK:

Component Azure (production) Whisper.cpp (local dev)
Real-time STT Azure Speech SDK whisper-stream
File transcription Azure Batch whisper-cli
HTTP API Azure REST API whisper-server
Cost Pay-per-use $0.00 (local)
Latency ~200ms (network) ~50ms (local Metal)
Languages 100+ 100+ (same Whisper model)

Potential Integration Points

  1. Desktop app (src/audio/azure_stt.py): Add local Whisper backend option
  2. iOS keyboard (LysnrKeyboard/): Use on-device Whisper for offline dictation
  3. Extraction service evals: Transcribe test audio fixtures locally