docs(local-llm): add docs index, hardware specs, and whisper-cpp setup

- docs/README.md: documentation index with quick start, file structure, status table - docs/01-hardware-and-prerequisites.md: M4 Pro 48GB specs, toolchain inventory, disk budget, network environment (Forcepoint proxy details) - docs/03-whisper-cpp-setup.md: whisper-cpp installation, GGML model guide, ffmpeg audio conversion, CLI usage, real-time streaming, LysnrAI integration
2026-02-19 13:00:48 -08:00 · 2026-02-19 13:00:48 -08:00 · 464ffb92ec
commit 464ffb92ec
parent 798a85e88b
3 changed files with 378 additions and 0 deletions
--- a/__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md
+++ b/__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md
@ -0,0 +1,110 @@
+# 01 — Hardware & Prerequisites
+
+> Machine specs, installed toolchain, and resource budgets for local LLM inference.
+
+---
+
+## Hardware Specs
+
+| Component               | Value                                    |
+| ----------------------- | ---------------------------------------- |
+| **Model**               | MacBook Pro (Mac16,7)                    |
+| **Model Number**        | Z1FU0002HLL/A                            |
+| **Chip**                | Apple M4 Pro                             |
+| **CPU Cores**           | 14 (10 Performance + 4 Efficiency)       |
+| **GPU**                 | Apple Silicon integrated (Metal backend) |
+| **Neural Engine**       | 16-core                                  |
+| **Memory**              | 48 GB LPDDR5 (unified, shared CPU/GPU)   |
+| **Memory Manufacturer** | Micron                                   |
+| **OS**                  | macOS Tahoe (arm64)                      |
+| **Serial**              | KX6VMGJWM6                               |
+
+### Why This Hardware Matters for LLMs
+
+Apple Silicon's **unified memory architecture** means the GPU and CPU share the same 48 GB pool. This is ideal for LLM inference because:
+
+1. No PCIe bottleneck copying weights between CPU RAM and VRAM
+2. Models up to ~45 GB can run entirely "on GPU" via Metal
+3. Ollama uses `llama.cpp` under the hood, which has excellent Metal backend support
+4. The M4 Pro Neural Engine further accelerates certain operations
+
+### What You Can Run
+
+| RAM Budget | Model Size      | Examples                                           |
+| ---------- | --------------- | -------------------------------------------------- |
+| 5-8 GB     | 7B models       | qwen2.5-coder:7b, llama3.1:8b, deepseek-coder:6.7b |
+| 10-14 GB   | 13-16B models   | deepseek-coder-v2:16b, codestral:22b, phi4:14b     |
+| 20-24 GB   | 32B models      | qwen2.5-coder:32b, deepseek-r1:32b                 |
+| 40-45 GB   | 70B models (Q4) | llama3.1:70b — tight, leaves little headroom       |
+
+**Rule of thumb:** Keep at least 6-8 GB free for macOS + dev tools (Xcode, VS Code, Docker, etc.).
+
+---
+
+## Installed Toolchain
+
+Verified on 2026-02-19.
+
+### Brew Packages
+
+| Package       | Version | Purpose                                    |
+| ------------- | ------- | ------------------------------------------ |
+| `ollama`      | 0.16.2  | LLM inference server (llama.cpp + Metal)   |
+| `whisper-cpp` | 1.8.3   | Local speech-to-text (Whisper GGML)        |
+| `ffmpeg`      | 8.0.1   | Audio/video format conversion              |
+| `sdl2`        | 2.32.10 | Audio I/O library (whisper-cpp dependency) |
+
+### Key Binaries
+
+```
+/opt/homebrew/bin/ollama
+/opt/homebrew/bin/whisper-cli
+/opt/homebrew/bin/whisper-server
+/opt/homebrew/bin/whisper-stream
+/opt/homebrew/bin/whisper-talk-llama
+/opt/homebrew/bin/whisper-bench
+/opt/homebrew/bin/whisper-command
+/opt/homebrew/bin/whisper-lsp
+/opt/homebrew/bin/whisper-quantize
+/opt/homebrew/bin/whisper-vad-speech-segments
+/opt/homebrew/bin/ffmpeg
+```
+
+### Storage Locations
+
+| Path                    | Content                                                   |
+| ----------------------- | --------------------------------------------------------- |
+| `~/.ollama/models/`     | Downloaded Ollama models (~24 GB currently)               |
+| `~/whisper-models/`     | Whisper GGML model files (empty — proxy blocked download) |
+| `/opt/homebrew/Cellar/` | Brew package binaries                                     |
+
+---
+
+## Network Environment
+
+This machine is on a **corporate network** with a Forcepoint proxy:
+
+- **Proxy:** `http://cso.proxy.att.com:8080/`
+- **SSL Inspection:** Forcepoint CertChecker intercepts HTTPS connections
+- **Impact:**
+  - Ollama model pulls work (Ollama handles proxy natively)
+  - Hugging Face downloads FAIL (curl, Python requests, huggingface_hub all blocked)
+  - Brew installs work (brew handles proxy)
+
+**Workaround:** Download Hugging Face models (e.g., Whisper GGML files) from a personal/home network. See [08-troubleshooting.md](08-troubleshooting.md).
+
+---
+
+## Disk Space Budget
+
+Approximate allocation for local AI tooling:
+
+| Component                                   | Disk Usage |
+| ------------------------------------------- | ---------- |
+| Ollama models (2 installed)                 | ~24 GB     |
+| Whisper models (planned)                    | ~1.6 GB    |
+| Brew packages (ollama, whisper-cpp, ffmpeg) | ~70 MB     |
+| Dashboard app (node_modules)                | ~300 MB    |
+| **Total**                                   | **~26 GB** |
+
+With 10 Ollama models (see [07-model-recommendations.md](07-model-recommendations.md)), expect **~115 GB** total disk usage for models.
--- a/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md
+++ b/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md
@ -0,0 +1,182 @@
+# 03 — Whisper.cpp Setup
+
+> Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.
+
+---
+
+## Installation
+
+```bash
+brew install whisper-cpp
+```
+
+- **Version installed:** 1.8.3
+- **Dependency installed:** sdl2 2.32.10 (audio I/O)
+- **Binary location:** `/opt/homebrew/bin/whisper-*`
+
+### Installed Binaries
+
+| Binary                        | Purpose                                          |
+| ----------------------------- | ------------------------------------------------ |
+| `whisper-cli`                 | **Main CLI** — transcribe audio files            |
+| `whisper-server`              | HTTP server — POST audio, get JSON transcription |
+| `whisper-stream`              | **Real-time** microphone streaming transcription |
+| `whisper-talk-llama`          | Voice → Whisper → LLaMA → TTS pipeline           |
+| `whisper-bench`               | Benchmark a model on your hardware               |
+| `whisper-command`             | Voice command detection                          |
+| `whisper-lsp`                 | Language Server Protocol integration             |
+| `whisper-quantize`            | Quantize models to smaller formats               |
+| `whisper-vad-speech-segments` | Voice Activity Detection — split audio by speech |
+
+> **Note:** The binary is `whisper-cli`, NOT `whisper-cpp`. The brew formula name differs from the binary name.
+
+---
+
+## GGML Model Files
+
+Whisper.cpp requires separate GGML model files (not included with brew install).
+
+### Model Size Guide
+
+| Model              | File                          | Disk Size  | Speed     | Accuracy             |
+| ------------------ | ----------------------------- | ---------- | --------- | -------------------- |
+| Tiny (English)     | `ggml-tiny.en.bin`            | 75 MB      | Blazing   | Basic                |
+| Base (English)     | `ggml-base.en.bin`            | 142 MB     | Very fast | Good                 |
+| Medium (English)   | `ggml-medium.en.bin`          | 1.5 GB     | Fast      | Great                |
+| Large v3           | `ggml-large-v3.bin`           | 3.1 GB     | Moderate  | Best                 |
+| **Large v3 Turbo** | **`ggml-large-v3-turbo.bin`** | **1.6 GB** | **Fast**  | **Best (distilled)** |
+
+**Recommended for M4 Pro:** `ggml-large-v3-turbo` — best accuracy at half the size of large-v3, Metal-accelerated.
+
+### Download Models
+
+Models are stored in `~/whisper-models/`.
+
+```bash
+mkdir -p ~/whisper-models
+
+# Recommended: Large v3 Turbo (~1.6 GB)
+curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
+  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
+
+# Alternative mirror
+curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
+  https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin
+```
+
+**Download sources:**
+
+- https://huggingface.co/ggerganov/whisper.cpp/tree/main
+- https://ggml.ggerganov.com/
+
+### Current Status (2026-02-19)
+
+❌ **Model download blocked by corporate proxy** (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See [08-troubleshooting.md](08-troubleshooting.md).
+
+---
+
+## Audio Format Requirements
+
+Whisper.cpp requires **WAV** format input (16-bit PCM, ideally 16 kHz mono).
+
+### ffmpeg Installation
+
+```bash
+brew install ffmpeg
+```
+
+Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)
+
+### Converting Audio Files
+
+```bash
+# m4a → wav (16kHz mono, optimal for Whisper)
+ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
+
+# mp3 → wav
+ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
+
+# Any format → wav
+ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav
+```
+
+### Tested Conversion (2026-02-19)
+
+```bash
+ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
+       -ar 16000 -ac 1 \
+       '/Users/sd9235/Downloads/recording.wav'
+# Result: 181 KB, 5.80 seconds, 16kHz mono
+```
+
+---
+
+## Usage
+
+### File Transcription
+
+```bash
+whisper-cli \
+  --model ~/whisper-models/ggml-large-v3-turbo.bin \
+  --language en \
+  --file /path/to/audio.wav
+```
+
+### Real-Time Microphone Streaming
+
+```bash
+whisper-stream \
+  --model ~/whisper-models/ggml-large-v3-turbo.bin \
+  --language en
+```
+
+This is particularly relevant for **LysnrAI** — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.
+
+### HTTP Server Mode
+
+```bash
+# Start server on port 8080
+whisper-server \
+  --model ~/whisper-models/ggml-large-v3-turbo.bin \
+  --port 8080
+
+# POST audio to get transcription
+curl -X POST http://localhost:8080/inference \
+  -F "file=@audio.wav" \
+  -F "response_format=json"
+```
+
+### Voice Activity Detection
+
+```bash
+whisper-vad-speech-segments \
+  --model ~/whisper-models/ggml-large-v3-turbo.bin \
+  --file recording.wav
+```
+
+### Benchmarking
+
+```bash
+whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin
+```
+
+---
+
+## Integration with LysnrAI
+
+The local Whisper stack can serve as an **offline fallback** or **dev replacement** for Azure Speech SDK:
+
+| Component          | Azure (production) | Whisper.cpp (local dev)   |
+| ------------------ | ------------------ | ------------------------- |
+| Real-time STT      | Azure Speech SDK   | `whisper-stream`          |
+| File transcription | Azure Batch        | `whisper-cli`             |
+| HTTP API           | Azure REST API     | `whisper-server`          |
+| Cost               | Pay-per-use        | $0.00 (local)             |
+| Latency            | ~200ms (network)   | ~50ms (local Metal)       |
+| Languages          | 100+               | 100+ (same Whisper model) |
+
+### Potential Integration Points
+
+1. **Desktop app** (`src/audio/azure_stt.py`): Add local Whisper backend option
+2. **iOS keyboard** (`LysnrKeyboard/`): Use on-device Whisper for offline dictation
+3. **Extraction service evals**: Transcribe test audio fixtures locally
--- a/__LOCAL_LLMs/docs/README.md
+++ b/__LOCAL_LLMs/docs/README.md
@ -0,0 +1,86 @@
+# Local LLM Stack — Documentation Index
+
+> Complete guide for the local AI inference stack on the ByteLyst development machine.
+> Hardware: **Apple M4 Pro · 48 GB LPDDR5 · macOS Tahoe**
+> Last updated: 2026-02-19
+
+---
+
+## Quick Start
+
+```bash
+# 1. Start Ollama
+ollama serve                    # or: brew services start ollama
+
+# 2. Load a model
+ollama run qwen2.5-coder:32b   # best coding model for this hardware
+
+# 3. Launch Mission Control dashboard
+cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
+# Open http://localhost:3100
+```
+
+---
+
+## Documentation
+
+| #   | Document                                                     | Description                                                          |
+| --- | ------------------------------------------------------------ | -------------------------------------------------------------------- |
+| 01  | [Hardware & Prerequisites](01-hardware-and-prerequisites.md) | Machine specs, installed toolchain, disk/RAM budget                  |
+| 02  | [Ollama Setup & Models](02-ollama-setup-and-models.md)       | Installation, server config, model management, memory behavior       |
+| 03  | [Whisper.cpp Setup](03-whisper-cpp-setup.md)                 | Speech-to-text: installation, models, CLI usage, real-time streaming |
+| 04  | [Multimodal Local Stack](04-multimodal-local-stack.md)       | Vision models, audio pipeline, video understanding status            |
+| 05  | [Mission Control Dashboard](05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features, running       |
+| 06  | [Extraction Service Evals](06-extraction-service-evals.md)   | promptfoo eval suite, Ollama vs Gemini comparison, Python sidecar    |
+| 07  | [Model Recommendations](07-model-recommendations.md)         | Tiered model guide by use case, size, and quality for M4 Pro 48GB    |
+| 08  | [Troubleshooting & Corporate Proxy](08-troubleshooting.md)   | Common issues, Forcepoint proxy workarounds, MLX warnings            |
+| 09  | [Environment Variables](09-environment-variables.md)         | All config vars for Ollama, Whisper, dashboard, evals                |
+
+---
+
+## Directory Structure
+
+```
+__LOCAL_LLMs/
+├── README.md                        ← you are here (moved from LOCAL_LLMs_setup_mac_m4_48gb.md)
+├── docs/
+│   ├── README.md                    ← this index
+│   ├── 01-hardware-and-prerequisites.md
+│   ├── 02-ollama-setup-and-models.md
+│   ├── 03-whisper-cpp-setup.md
+│   ├── 04-multimodal-local-stack.md
+│   ├── 05-mission-control-dashboard.md
+│   ├── 06-extraction-service-evals.md
+│   ├── 07-model-recommendations.md
+│   ├── 08-troubleshooting.md
+│   └── 09-environment-variables.md
+├── dashboard/                       ← Next.js Mission Control app (port 3100)
+│   ├── src/app/page.tsx             ← main dashboard UI
+│   ├── src/app/api/ollama/route.ts  ← Ollama API proxy (list, load, unload, generate)
+│   ├── src/app/api/whisper/route.ts ← Whisper binary/model discovery
+│   └── src/app/api/system/route.ts  ← System info (chip, RAM, disk, brew)
+└── LOCAL_LLMs_setup_mac_m4_48gb.md  ← original doc (preserved, see docs/ for latest)
+```
+
+---
+
+## Current Installation Status (2026-02-19)
+
+| Component                           | Version    | Status                        | Disk Usage |
+| ----------------------------------- | ---------- | ----------------------------- | ---------- |
+| Ollama                              | 0.16.2     | ✅ Installed via brew         | —          |
+| qwen2.5-coder:32b                   | —          | ✅ Downloaded                 | 19 GB      |
+| llama3.1:8b                         | —          | ✅ Downloaded                 | 4.9 GB     |
+| whisper-cpp                         | 1.8.3      | ✅ Installed via brew         | 9.6 MB     |
+| whisper model (ggml-large-v3-turbo) | —          | ❌ Blocked by corporate proxy | —          |
+| ffmpeg                              | 8.0.1      | ✅ Installed via brew         | 53.3 MB    |
+| Mission Control Dashboard           | Next.js 16 | ✅ Built, runs on :3100       | —          |
+
+---
+
+## Related Resources
+
+- **Extraction service evals:** `services/extraction-service/evals/`
+- **Ollama REST API docs:** https://github.com/ollama/ollama/blob/main/docs/api.md
+- **Whisper.cpp:** https://github.com/ggerganov/whisper.cpp
+- **Hugging Face models:** https://huggingface.co/ggerganov/whisper.cpp/tree/main