diff --git a/__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md b/__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md new file mode 100644 index 00000000..e4175a07 --- /dev/null +++ b/__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md @@ -0,0 +1,110 @@ +# 01 — Hardware & Prerequisites + +> Machine specs, installed toolchain, and resource budgets for local LLM inference. + +--- + +## Hardware Specs + +| Component | Value | +| ----------------------- | ---------------------------------------- | +| **Model** | MacBook Pro (Mac16,7) | +| **Model Number** | Z1FU0002HLL/A | +| **Chip** | Apple M4 Pro | +| **CPU Cores** | 14 (10 Performance + 4 Efficiency) | +| **GPU** | Apple Silicon integrated (Metal backend) | +| **Neural Engine** | 16-core | +| **Memory** | 48 GB LPDDR5 (unified, shared CPU/GPU) | +| **Memory Manufacturer** | Micron | +| **OS** | macOS Tahoe (arm64) | +| **Serial** | KX6VMGJWM6 | + +### Why This Hardware Matters for LLMs + +Apple Silicon's **unified memory architecture** means the GPU and CPU share the same 48 GB pool. This is ideal for LLM inference because: + +1. No PCIe bottleneck copying weights between CPU RAM and VRAM +2. Models up to ~45 GB can run entirely "on GPU" via Metal +3. Ollama uses `llama.cpp` under the hood, which has excellent Metal backend support +4. The M4 Pro Neural Engine further accelerates certain operations + +### What You Can Run + +| RAM Budget | Model Size | Examples | +| ---------- | --------------- | -------------------------------------------------- | +| 5-8 GB | 7B models | qwen2.5-coder:7b, llama3.1:8b, deepseek-coder:6.7b | +| 10-14 GB | 13-16B models | deepseek-coder-v2:16b, codestral:22b, phi4:14b | +| 20-24 GB | 32B models | qwen2.5-coder:32b, deepseek-r1:32b | +| 40-45 GB | 70B models (Q4) | llama3.1:70b — tight, leaves little headroom | + +**Rule of thumb:** Keep at least 6-8 GB free for macOS + dev tools (Xcode, VS Code, Docker, etc.). + +--- + +## Installed Toolchain + +Verified on 2026-02-19. + +### Brew Packages + +| Package | Version | Purpose | +| ------------- | ------- | ------------------------------------------ | +| `ollama` | 0.16.2 | LLM inference server (llama.cpp + Metal) | +| `whisper-cpp` | 1.8.3 | Local speech-to-text (Whisper GGML) | +| `ffmpeg` | 8.0.1 | Audio/video format conversion | +| `sdl2` | 2.32.10 | Audio I/O library (whisper-cpp dependency) | + +### Key Binaries + +``` +/opt/homebrew/bin/ollama +/opt/homebrew/bin/whisper-cli +/opt/homebrew/bin/whisper-server +/opt/homebrew/bin/whisper-stream +/opt/homebrew/bin/whisper-talk-llama +/opt/homebrew/bin/whisper-bench +/opt/homebrew/bin/whisper-command +/opt/homebrew/bin/whisper-lsp +/opt/homebrew/bin/whisper-quantize +/opt/homebrew/bin/whisper-vad-speech-segments +/opt/homebrew/bin/ffmpeg +``` + +### Storage Locations + +| Path | Content | +| ----------------------- | --------------------------------------------------------- | +| `~/.ollama/models/` | Downloaded Ollama models (~24 GB currently) | +| `~/whisper-models/` | Whisper GGML model files (empty — proxy blocked download) | +| `/opt/homebrew/Cellar/` | Brew package binaries | + +--- + +## Network Environment + +This machine is on a **corporate network** with a Forcepoint proxy: + +- **Proxy:** `http://cso.proxy.att.com:8080/` +- **SSL Inspection:** Forcepoint CertChecker intercepts HTTPS connections +- **Impact:** + - Ollama model pulls work (Ollama handles proxy natively) + - Hugging Face downloads FAIL (curl, Python requests, huggingface_hub all blocked) + - Brew installs work (brew handles proxy) + +**Workaround:** Download Hugging Face models (e.g., Whisper GGML files) from a personal/home network. See [08-troubleshooting.md](08-troubleshooting.md). + +--- + +## Disk Space Budget + +Approximate allocation for local AI tooling: + +| Component | Disk Usage | +| ------------------------------------------- | ---------- | +| Ollama models (2 installed) | ~24 GB | +| Whisper models (planned) | ~1.6 GB | +| Brew packages (ollama, whisper-cpp, ffmpeg) | ~70 MB | +| Dashboard app (node_modules) | ~300 MB | +| **Total** | **~26 GB** | + +With 10 Ollama models (see [07-model-recommendations.md](07-model-recommendations.md)), expect **~115 GB** total disk usage for models. diff --git a/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md b/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md new file mode 100644 index 00000000..7a0a9bc0 --- /dev/null +++ b/__LOCAL_LLMs/docs/03-whisper-cpp-setup.md @@ -0,0 +1,182 @@ +# 03 — Whisper.cpp Setup + +> Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg. + +--- + +## Installation + +```bash +brew install whisper-cpp +``` + +- **Version installed:** 1.8.3 +- **Dependency installed:** sdl2 2.32.10 (audio I/O) +- **Binary location:** `/opt/homebrew/bin/whisper-*` + +### Installed Binaries + +| Binary | Purpose | +| ----------------------------- | ------------------------------------------------ | +| `whisper-cli` | **Main CLI** — transcribe audio files | +| `whisper-server` | HTTP server — POST audio, get JSON transcription | +| `whisper-stream` | **Real-time** microphone streaming transcription | +| `whisper-talk-llama` | Voice → Whisper → LLaMA → TTS pipeline | +| `whisper-bench` | Benchmark a model on your hardware | +| `whisper-command` | Voice command detection | +| `whisper-lsp` | Language Server Protocol integration | +| `whisper-quantize` | Quantize models to smaller formats | +| `whisper-vad-speech-segments` | Voice Activity Detection — split audio by speech | + +> **Note:** The binary is `whisper-cli`, NOT `whisper-cpp`. The brew formula name differs from the binary name. + +--- + +## GGML Model Files + +Whisper.cpp requires separate GGML model files (not included with brew install). + +### Model Size Guide + +| Model | File | Disk Size | Speed | Accuracy | +| ------------------ | ----------------------------- | ---------- | --------- | -------------------- | +| Tiny (English) | `ggml-tiny.en.bin` | 75 MB | Blazing | Basic | +| Base (English) | `ggml-base.en.bin` | 142 MB | Very fast | Good | +| Medium (English) | `ggml-medium.en.bin` | 1.5 GB | Fast | Great | +| Large v3 | `ggml-large-v3.bin` | 3.1 GB | Moderate | Best | +| **Large v3 Turbo** | **`ggml-large-v3-turbo.bin`** | **1.6 GB** | **Fast** | **Best (distilled)** | + +**Recommended for M4 Pro:** `ggml-large-v3-turbo` — best accuracy at half the size of large-v3, Metal-accelerated. + +### Download Models + +Models are stored in `~/whisper-models/`. + +```bash +mkdir -p ~/whisper-models + +# Recommended: Large v3 Turbo (~1.6 GB) +curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \ + https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin + +# Alternative mirror +curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \ + https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin +``` + +**Download sources:** + +- https://huggingface.co/ggerganov/whisper.cpp/tree/main +- https://ggml.ggerganov.com/ + +### Current Status (2026-02-19) + +❌ **Model download blocked by corporate proxy** (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See [08-troubleshooting.md](08-troubleshooting.md). + +--- + +## Audio Format Requirements + +Whisper.cpp requires **WAV** format input (16-bit PCM, ideally 16 kHz mono). + +### ffmpeg Installation + +```bash +brew install ffmpeg +``` + +Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265) + +### Converting Audio Files + +```bash +# m4a → wav (16kHz mono, optimal for Whisper) +ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav + +# mp3 → wav +ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav + +# Any format → wav +ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav +``` + +### Tested Conversion (2026-02-19) + +```bash +ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \ + -ar 16000 -ac 1 \ + '/Users/sd9235/Downloads/recording.wav' +# Result: 181 KB, 5.80 seconds, 16kHz mono +``` + +--- + +## Usage + +### File Transcription + +```bash +whisper-cli \ + --model ~/whisper-models/ggml-large-v3-turbo.bin \ + --language en \ + --file /path/to/audio.wav +``` + +### Real-Time Microphone Streaming + +```bash +whisper-stream \ + --model ~/whisper-models/ggml-large-v3-turbo.bin \ + --language en +``` + +This is particularly relevant for **LysnrAI** — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing. + +### HTTP Server Mode + +```bash +# Start server on port 8080 +whisper-server \ + --model ~/whisper-models/ggml-large-v3-turbo.bin \ + --port 8080 + +# POST audio to get transcription +curl -X POST http://localhost:8080/inference \ + -F "file=@audio.wav" \ + -F "response_format=json" +``` + +### Voice Activity Detection + +```bash +whisper-vad-speech-segments \ + --model ~/whisper-models/ggml-large-v3-turbo.bin \ + --file recording.wav +``` + +### Benchmarking + +```bash +whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin +``` + +--- + +## Integration with LysnrAI + +The local Whisper stack can serve as an **offline fallback** or **dev replacement** for Azure Speech SDK: + +| Component | Azure (production) | Whisper.cpp (local dev) | +| ------------------ | ------------------ | ------------------------- | +| Real-time STT | Azure Speech SDK | `whisper-stream` | +| File transcription | Azure Batch | `whisper-cli` | +| HTTP API | Azure REST API | `whisper-server` | +| Cost | Pay-per-use | $0.00 (local) | +| Latency | ~200ms (network) | ~50ms (local Metal) | +| Languages | 100+ | 100+ (same Whisper model) | + +### Potential Integration Points + +1. **Desktop app** (`src/audio/azure_stt.py`): Add local Whisper backend option +2. **iOS keyboard** (`LysnrKeyboard/`): Use on-device Whisper for offline dictation +3. **Extraction service evals**: Transcribe test audio fixtures locally diff --git a/__LOCAL_LLMs/docs/README.md b/__LOCAL_LLMs/docs/README.md new file mode 100644 index 00000000..95d76514 --- /dev/null +++ b/__LOCAL_LLMs/docs/README.md @@ -0,0 +1,86 @@ +# Local LLM Stack — Documentation Index + +> Complete guide for the local AI inference stack on the ByteLyst development machine. +> Hardware: **Apple M4 Pro · 48 GB LPDDR5 · macOS Tahoe** +> Last updated: 2026-02-19 + +--- + +## Quick Start + +```bash +# 1. Start Ollama +ollama serve # or: brew services start ollama + +# 2. Load a model +ollama run qwen2.5-coder:32b # best coding model for this hardware + +# 3. Launch Mission Control dashboard +cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100 +# Open http://localhost:3100 +``` + +--- + +## Documentation + +| # | Document | Description | +| --- | ------------------------------------------------------------ | -------------------------------------------------------------------- | +| 01 | [Hardware & Prerequisites](01-hardware-and-prerequisites.md) | Machine specs, installed toolchain, disk/RAM budget | +| 02 | [Ollama Setup & Models](02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior | +| 03 | [Whisper.cpp Setup](03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI usage, real-time streaming | +| 04 | [Multimodal Local Stack](04-multimodal-local-stack.md) | Vision models, audio pipeline, video understanding status | +| 05 | [Mission Control Dashboard](05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features, running | +| 06 | [Extraction Service Evals](06-extraction-service-evals.md) | promptfoo eval suite, Ollama vs Gemini comparison, Python sidecar | +| 07 | [Model Recommendations](07-model-recommendations.md) | Tiered model guide by use case, size, and quality for M4 Pro 48GB | +| 08 | [Troubleshooting & Corporate Proxy](08-troubleshooting.md) | Common issues, Forcepoint proxy workarounds, MLX warnings | +| 09 | [Environment Variables](09-environment-variables.md) | All config vars for Ollama, Whisper, dashboard, evals | + +--- + +## Directory Structure + +``` +__LOCAL_LLMs/ +├── README.md ← you are here (moved from LOCAL_LLMs_setup_mac_m4_48gb.md) +├── docs/ +│ ├── README.md ← this index +│ ├── 01-hardware-and-prerequisites.md +│ ├── 02-ollama-setup-and-models.md +│ ├── 03-whisper-cpp-setup.md +│ ├── 04-multimodal-local-stack.md +│ ├── 05-mission-control-dashboard.md +│ ├── 06-extraction-service-evals.md +│ ├── 07-model-recommendations.md +│ ├── 08-troubleshooting.md +│ └── 09-environment-variables.md +├── dashboard/ ← Next.js Mission Control app (port 3100) +│ ├── src/app/page.tsx ← main dashboard UI +│ ├── src/app/api/ollama/route.ts ← Ollama API proxy (list, load, unload, generate) +│ ├── src/app/api/whisper/route.ts ← Whisper binary/model discovery +│ └── src/app/api/system/route.ts ← System info (chip, RAM, disk, brew) +└── LOCAL_LLMs_setup_mac_m4_48gb.md ← original doc (preserved, see docs/ for latest) +``` + +--- + +## Current Installation Status (2026-02-19) + +| Component | Version | Status | Disk Usage | +| ----------------------------------- | ---------- | ----------------------------- | ---------- | +| Ollama | 0.16.2 | ✅ Installed via brew | — | +| qwen2.5-coder:32b | — | ✅ Downloaded | 19 GB | +| llama3.1:8b | — | ✅ Downloaded | 4.9 GB | +| whisper-cpp | 1.8.3 | ✅ Installed via brew | 9.6 MB | +| whisper model (ggml-large-v3-turbo) | — | ❌ Blocked by corporate proxy | — | +| ffmpeg | 8.0.1 | ✅ Installed via brew | 53.3 MB | +| Mission Control Dashboard | Next.js 16 | ✅ Built, runs on :3100 | — | + +--- + +## Related Resources + +- **Extraction service evals:** `services/extraction-service/evals/` +- **Ollama REST API docs:** https://github.com/ollama/ollama/blob/main/docs/api.md +- **Whisper.cpp:** https://github.com/ggerganov/whisper.cpp +- **Hugging Face models:** https://huggingface.co/ggerganov/whisper.cpp/tree/main