docs(local-llm): add docs index, hardware specs, and whisper-cpp setup

- docs/README.md: documentation index with quick start, file structure, status table
- docs/01-hardware-and-prerequisites.md: M4 Pro 48GB specs, toolchain inventory,
  disk budget, network environment (Forcepoint proxy details)
- docs/03-whisper-cpp-setup.md: whisper-cpp installation, GGML model guide,
  ffmpeg audio conversion, CLI usage, real-time streaming, LysnrAI integration
This commit is contained in:
saravanakumardb1 2026-02-19 13:00:48 -08:00
parent 798a85e88b
commit 464ffb92ec
3 changed files with 378 additions and 0 deletions

View File

@ -0,0 +1,110 @@
# 01 — Hardware & Prerequisites
> Machine specs, installed toolchain, and resource budgets for local LLM inference.
---
## Hardware Specs
| Component | Value |
| ----------------------- | ---------------------------------------- |
| **Model** | MacBook Pro (Mac16,7) |
| **Model Number** | Z1FU0002HLL/A |
| **Chip** | Apple M4 Pro |
| **CPU Cores** | 14 (10 Performance + 4 Efficiency) |
| **GPU** | Apple Silicon integrated (Metal backend) |
| **Neural Engine** | 16-core |
| **Memory** | 48 GB LPDDR5 (unified, shared CPU/GPU) |
| **Memory Manufacturer** | Micron |
| **OS** | macOS Tahoe (arm64) |
| **Serial** | KX6VMGJWM6 |
### Why This Hardware Matters for LLMs
Apple Silicon's **unified memory architecture** means the GPU and CPU share the same 48 GB pool. This is ideal for LLM inference because:
1. No PCIe bottleneck copying weights between CPU RAM and VRAM
2. Models up to ~45 GB can run entirely "on GPU" via Metal
3. Ollama uses `llama.cpp` under the hood, which has excellent Metal backend support
4. The M4 Pro Neural Engine further accelerates certain operations
### What You Can Run
| RAM Budget | Model Size | Examples |
| ---------- | --------------- | -------------------------------------------------- |
| 5-8 GB | 7B models | qwen2.5-coder:7b, llama3.1:8b, deepseek-coder:6.7b |
| 10-14 GB | 13-16B models | deepseek-coder-v2:16b, codestral:22b, phi4:14b |
| 20-24 GB | 32B models | qwen2.5-coder:32b, deepseek-r1:32b |
| 40-45 GB | 70B models (Q4) | llama3.1:70b — tight, leaves little headroom |
**Rule of thumb:** Keep at least 6-8 GB free for macOS + dev tools (Xcode, VS Code, Docker, etc.).
---
## Installed Toolchain
Verified on 2026-02-19.
### Brew Packages
| Package | Version | Purpose |
| ------------- | ------- | ------------------------------------------ |
| `ollama` | 0.16.2 | LLM inference server (llama.cpp + Metal) |
| `whisper-cpp` | 1.8.3 | Local speech-to-text (Whisper GGML) |
| `ffmpeg` | 8.0.1 | Audio/video format conversion |
| `sdl2` | 2.32.10 | Audio I/O library (whisper-cpp dependency) |
### Key Binaries
```
/opt/homebrew/bin/ollama
/opt/homebrew/bin/whisper-cli
/opt/homebrew/bin/whisper-server
/opt/homebrew/bin/whisper-stream
/opt/homebrew/bin/whisper-talk-llama
/opt/homebrew/bin/whisper-bench
/opt/homebrew/bin/whisper-command
/opt/homebrew/bin/whisper-lsp
/opt/homebrew/bin/whisper-quantize
/opt/homebrew/bin/whisper-vad-speech-segments
/opt/homebrew/bin/ffmpeg
```
### Storage Locations
| Path | Content |
| ----------------------- | --------------------------------------------------------- |
| `~/.ollama/models/` | Downloaded Ollama models (~24 GB currently) |
| `~/whisper-models/` | Whisper GGML model files (empty — proxy blocked download) |
| `/opt/homebrew/Cellar/` | Brew package binaries |
---
## Network Environment
This machine is on a **corporate network** with a Forcepoint proxy:
- **Proxy:** `http://cso.proxy.att.com:8080/`
- **SSL Inspection:** Forcepoint CertChecker intercepts HTTPS connections
- **Impact:**
- Ollama model pulls work (Ollama handles proxy natively)
- Hugging Face downloads FAIL (curl, Python requests, huggingface_hub all blocked)
- Brew installs work (brew handles proxy)
**Workaround:** Download Hugging Face models (e.g., Whisper GGML files) from a personal/home network. See [08-troubleshooting.md](08-troubleshooting.md).
---
## Disk Space Budget
Approximate allocation for local AI tooling:
| Component | Disk Usage |
| ------------------------------------------- | ---------- |
| Ollama models (2 installed) | ~24 GB |
| Whisper models (planned) | ~1.6 GB |
| Brew packages (ollama, whisper-cpp, ffmpeg) | ~70 MB |
| Dashboard app (node_modules) | ~300 MB |
| **Total** | **~26 GB** |
With 10 Ollama models (see [07-model-recommendations.md](07-model-recommendations.md)), expect **~115 GB** total disk usage for models.

View File

@ -0,0 +1,182 @@
# 03 — Whisper.cpp Setup
> Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.
---
## Installation
```bash
brew install whisper-cpp
```
- **Version installed:** 1.8.3
- **Dependency installed:** sdl2 2.32.10 (audio I/O)
- **Binary location:** `/opt/homebrew/bin/whisper-*`
### Installed Binaries
| Binary | Purpose |
| ----------------------------- | ------------------------------------------------ |
| `whisper-cli` | **Main CLI** — transcribe audio files |
| `whisper-server` | HTTP server — POST audio, get JSON transcription |
| `whisper-stream` | **Real-time** microphone streaming transcription |
| `whisper-talk-llama` | Voice → Whisper → LLaMA → TTS pipeline |
| `whisper-bench` | Benchmark a model on your hardware |
| `whisper-command` | Voice command detection |
| `whisper-lsp` | Language Server Protocol integration |
| `whisper-quantize` | Quantize models to smaller formats |
| `whisper-vad-speech-segments` | Voice Activity Detection — split audio by speech |
> **Note:** The binary is `whisper-cli`, NOT `whisper-cpp`. The brew formula name differs from the binary name.
---
## GGML Model Files
Whisper.cpp requires separate GGML model files (not included with brew install).
### Model Size Guide
| Model | File | Disk Size | Speed | Accuracy |
| ------------------ | ----------------------------- | ---------- | --------- | -------------------- |
| Tiny (English) | `ggml-tiny.en.bin` | 75 MB | Blazing | Basic |
| Base (English) | `ggml-base.en.bin` | 142 MB | Very fast | Good |
| Medium (English) | `ggml-medium.en.bin` | 1.5 GB | Fast | Great |
| Large v3 | `ggml-large-v3.bin` | 3.1 GB | Moderate | Best |
| **Large v3 Turbo** | **`ggml-large-v3-turbo.bin`** | **1.6 GB** | **Fast** | **Best (distilled)** |
**Recommended for M4 Pro:** `ggml-large-v3-turbo` — best accuracy at half the size of large-v3, Metal-accelerated.
### Download Models
Models are stored in `~/whisper-models/`.
```bash
mkdir -p ~/whisper-models
# Recommended: Large v3 Turbo (~1.6 GB)
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Alternative mirror
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin
```
**Download sources:**
- https://huggingface.co/ggerganov/whisper.cpp/tree/main
- https://ggml.ggerganov.com/
### Current Status (2026-02-19)
**Model download blocked by corporate proxy** (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See [08-troubleshooting.md](08-troubleshooting.md).
---
## Audio Format Requirements
Whisper.cpp requires **WAV** format input (16-bit PCM, ideally 16 kHz mono).
### ffmpeg Installation
```bash
brew install ffmpeg
```
Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)
### Converting Audio Files
```bash
# m4a → wav (16kHz mono, optimal for Whisper)
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
# mp3 → wav
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
# Any format → wav
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav
```
### Tested Conversion (2026-02-19)
```bash
ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
-ar 16000 -ac 1 \
'/Users/sd9235/Downloads/recording.wav'
# Result: 181 KB, 5.80 seconds, 16kHz mono
```
---
## Usage
### File Transcription
```bash
whisper-cli \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--language en \
--file /path/to/audio.wav
```
### Real-Time Microphone Streaming
```bash
whisper-stream \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--language en
```
This is particularly relevant for **LysnrAI** — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.
### HTTP Server Mode
```bash
# Start server on port 8080
whisper-server \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--port 8080
# POST audio to get transcription
curl -X POST http://localhost:8080/inference \
-F "file=@audio.wav" \
-F "response_format=json"
```
### Voice Activity Detection
```bash
whisper-vad-speech-segments \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--file recording.wav
```
### Benchmarking
```bash
whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin
```
---
## Integration with LysnrAI
The local Whisper stack can serve as an **offline fallback** or **dev replacement** for Azure Speech SDK:
| Component | Azure (production) | Whisper.cpp (local dev) |
| ------------------ | ------------------ | ------------------------- |
| Real-time STT | Azure Speech SDK | `whisper-stream` |
| File transcription | Azure Batch | `whisper-cli` |
| HTTP API | Azure REST API | `whisper-server` |
| Cost | Pay-per-use | $0.00 (local) |
| Latency | ~200ms (network) | ~50ms (local Metal) |
| Languages | 100+ | 100+ (same Whisper model) |
### Potential Integration Points
1. **Desktop app** (`src/audio/azure_stt.py`): Add local Whisper backend option
2. **iOS keyboard** (`LysnrKeyboard/`): Use on-device Whisper for offline dictation
3. **Extraction service evals**: Transcribe test audio fixtures locally

View File

@ -0,0 +1,86 @@
# Local LLM Stack — Documentation Index
> Complete guide for the local AI inference stack on the ByteLyst development machine.
> Hardware: **Apple M4 Pro · 48 GB LPDDR5 · macOS Tahoe**
> Last updated: 2026-02-19
---
## Quick Start
```bash
# 1. Start Ollama
ollama serve # or: brew services start ollama
# 2. Load a model
ollama run qwen2.5-coder:32b # best coding model for this hardware
# 3. Launch Mission Control dashboard
cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
# Open http://localhost:3100
```
---
## Documentation
| # | Document | Description |
| --- | ------------------------------------------------------------ | -------------------------------------------------------------------- |
| 01 | [Hardware & Prerequisites](01-hardware-and-prerequisites.md) | Machine specs, installed toolchain, disk/RAM budget |
| 02 | [Ollama Setup & Models](02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior |
| 03 | [Whisper.cpp Setup](03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI usage, real-time streaming |
| 04 | [Multimodal Local Stack](04-multimodal-local-stack.md) | Vision models, audio pipeline, video understanding status |
| 05 | [Mission Control Dashboard](05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features, running |
| 06 | [Extraction Service Evals](06-extraction-service-evals.md) | promptfoo eval suite, Ollama vs Gemini comparison, Python sidecar |
| 07 | [Model Recommendations](07-model-recommendations.md) | Tiered model guide by use case, size, and quality for M4 Pro 48GB |
| 08 | [Troubleshooting & Corporate Proxy](08-troubleshooting.md) | Common issues, Forcepoint proxy workarounds, MLX warnings |
| 09 | [Environment Variables](09-environment-variables.md) | All config vars for Ollama, Whisper, dashboard, evals |
---
## Directory Structure
```
__LOCAL_LLMs/
├── README.md ← you are here (moved from LOCAL_LLMs_setup_mac_m4_48gb.md)
├── docs/
│ ├── README.md ← this index
│ ├── 01-hardware-and-prerequisites.md
│ ├── 02-ollama-setup-and-models.md
│ ├── 03-whisper-cpp-setup.md
│ ├── 04-multimodal-local-stack.md
│ ├── 05-mission-control-dashboard.md
│ ├── 06-extraction-service-evals.md
│ ├── 07-model-recommendations.md
│ ├── 08-troubleshooting.md
│ └── 09-environment-variables.md
├── dashboard/ ← Next.js Mission Control app (port 3100)
│ ├── src/app/page.tsx ← main dashboard UI
│ ├── src/app/api/ollama/route.ts ← Ollama API proxy (list, load, unload, generate)
│ ├── src/app/api/whisper/route.ts ← Whisper binary/model discovery
│ └── src/app/api/system/route.ts ← System info (chip, RAM, disk, brew)
└── LOCAL_LLMs_setup_mac_m4_48gb.md ← original doc (preserved, see docs/ for latest)
```
---
## Current Installation Status (2026-02-19)
| Component | Version | Status | Disk Usage |
| ----------------------------------- | ---------- | ----------------------------- | ---------- |
| Ollama | 0.16.2 | ✅ Installed via brew | — |
| qwen2.5-coder:32b | — | ✅ Downloaded | 19 GB |
| llama3.1:8b | — | ✅ Downloaded | 4.9 GB |
| whisper-cpp | 1.8.3 | ✅ Installed via brew | 9.6 MB |
| whisper model (ggml-large-v3-turbo) | — | ❌ Blocked by corporate proxy | — |
| ffmpeg | 8.0.1 | ✅ Installed via brew | 53.3 MB |
| Mission Control Dashboard | Next.js 16 | ✅ Built, runs on :3100 | — |
---
## Related Resources
- **Extraction service evals:** `services/extraction-service/evals/`
- **Ollama REST API docs:** https://github.com/ollama/ollama/blob/main/docs/api.md
- **Whisper.cpp:** https://github.com/ggerganov/whisper.cpp
- **Hugging Face models:** https://huggingface.co/ggerganov/whisper.cpp/tree/main