docs(local-llm): add docs index, hardware specs, and whisper-cpp setup
- docs/README.md: documentation index with quick start, file structure, status table - docs/01-hardware-and-prerequisites.md: M4 Pro 48GB specs, toolchain inventory, disk budget, network environment (Forcepoint proxy details) - docs/03-whisper-cpp-setup.md: whisper-cpp installation, GGML model guide, ffmpeg audio conversion, CLI usage, real-time streaming, LysnrAI integration
This commit is contained in:
parent
798a85e88b
commit
464ffb92ec
110
__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md
Normal file
110
__LOCAL_LLMs/docs/01-hardware-and-prerequisites.md
Normal file
@ -0,0 +1,110 @@
|
||||
# 01 — Hardware & Prerequisites
|
||||
|
||||
> Machine specs, installed toolchain, and resource budgets for local LLM inference.
|
||||
|
||||
---
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Value |
|
||||
| ----------------------- | ---------------------------------------- |
|
||||
| **Model** | MacBook Pro (Mac16,7) |
|
||||
| **Model Number** | Z1FU0002HLL/A |
|
||||
| **Chip** | Apple M4 Pro |
|
||||
| **CPU Cores** | 14 (10 Performance + 4 Efficiency) |
|
||||
| **GPU** | Apple Silicon integrated (Metal backend) |
|
||||
| **Neural Engine** | 16-core |
|
||||
| **Memory** | 48 GB LPDDR5 (unified, shared CPU/GPU) |
|
||||
| **Memory Manufacturer** | Micron |
|
||||
| **OS** | macOS Tahoe (arm64) |
|
||||
| **Serial** | KX6VMGJWM6 |
|
||||
|
||||
### Why This Hardware Matters for LLMs
|
||||
|
||||
Apple Silicon's **unified memory architecture** means the GPU and CPU share the same 48 GB pool. This is ideal for LLM inference because:
|
||||
|
||||
1. No PCIe bottleneck copying weights between CPU RAM and VRAM
|
||||
2. Models up to ~45 GB can run entirely "on GPU" via Metal
|
||||
3. Ollama uses `llama.cpp` under the hood, which has excellent Metal backend support
|
||||
4. The M4 Pro Neural Engine further accelerates certain operations
|
||||
|
||||
### What You Can Run
|
||||
|
||||
| RAM Budget | Model Size | Examples |
|
||||
| ---------- | --------------- | -------------------------------------------------- |
|
||||
| 5-8 GB | 7B models | qwen2.5-coder:7b, llama3.1:8b, deepseek-coder:6.7b |
|
||||
| 10-14 GB | 13-16B models | deepseek-coder-v2:16b, codestral:22b, phi4:14b |
|
||||
| 20-24 GB | 32B models | qwen2.5-coder:32b, deepseek-r1:32b |
|
||||
| 40-45 GB | 70B models (Q4) | llama3.1:70b — tight, leaves little headroom |
|
||||
|
||||
**Rule of thumb:** Keep at least 6-8 GB free for macOS + dev tools (Xcode, VS Code, Docker, etc.).
|
||||
|
||||
---
|
||||
|
||||
## Installed Toolchain
|
||||
|
||||
Verified on 2026-02-19.
|
||||
|
||||
### Brew Packages
|
||||
|
||||
| Package | Version | Purpose |
|
||||
| ------------- | ------- | ------------------------------------------ |
|
||||
| `ollama` | 0.16.2 | LLM inference server (llama.cpp + Metal) |
|
||||
| `whisper-cpp` | 1.8.3 | Local speech-to-text (Whisper GGML) |
|
||||
| `ffmpeg` | 8.0.1 | Audio/video format conversion |
|
||||
| `sdl2` | 2.32.10 | Audio I/O library (whisper-cpp dependency) |
|
||||
|
||||
### Key Binaries
|
||||
|
||||
```
|
||||
/opt/homebrew/bin/ollama
|
||||
/opt/homebrew/bin/whisper-cli
|
||||
/opt/homebrew/bin/whisper-server
|
||||
/opt/homebrew/bin/whisper-stream
|
||||
/opt/homebrew/bin/whisper-talk-llama
|
||||
/opt/homebrew/bin/whisper-bench
|
||||
/opt/homebrew/bin/whisper-command
|
||||
/opt/homebrew/bin/whisper-lsp
|
||||
/opt/homebrew/bin/whisper-quantize
|
||||
/opt/homebrew/bin/whisper-vad-speech-segments
|
||||
/opt/homebrew/bin/ffmpeg
|
||||
```
|
||||
|
||||
### Storage Locations
|
||||
|
||||
| Path | Content |
|
||||
| ----------------------- | --------------------------------------------------------- |
|
||||
| `~/.ollama/models/` | Downloaded Ollama models (~24 GB currently) |
|
||||
| `~/whisper-models/` | Whisper GGML model files (empty — proxy blocked download) |
|
||||
| `/opt/homebrew/Cellar/` | Brew package binaries |
|
||||
|
||||
---
|
||||
|
||||
## Network Environment
|
||||
|
||||
This machine is on a **corporate network** with a Forcepoint proxy:
|
||||
|
||||
- **Proxy:** `http://cso.proxy.att.com:8080/`
|
||||
- **SSL Inspection:** Forcepoint CertChecker intercepts HTTPS connections
|
||||
- **Impact:**
|
||||
- Ollama model pulls work (Ollama handles proxy natively)
|
||||
- Hugging Face downloads FAIL (curl, Python requests, huggingface_hub all blocked)
|
||||
- Brew installs work (brew handles proxy)
|
||||
|
||||
**Workaround:** Download Hugging Face models (e.g., Whisper GGML files) from a personal/home network. See [08-troubleshooting.md](08-troubleshooting.md).
|
||||
|
||||
---
|
||||
|
||||
## Disk Space Budget
|
||||
|
||||
Approximate allocation for local AI tooling:
|
||||
|
||||
| Component | Disk Usage |
|
||||
| ------------------------------------------- | ---------- |
|
||||
| Ollama models (2 installed) | ~24 GB |
|
||||
| Whisper models (planned) | ~1.6 GB |
|
||||
| Brew packages (ollama, whisper-cpp, ffmpeg) | ~70 MB |
|
||||
| Dashboard app (node_modules) | ~300 MB |
|
||||
| **Total** | **~26 GB** |
|
||||
|
||||
With 10 Ollama models (see [07-model-recommendations.md](07-model-recommendations.md)), expect **~115 GB** total disk usage for models.
|
||||
182
__LOCAL_LLMs/docs/03-whisper-cpp-setup.md
Normal file
182
__LOCAL_LLMs/docs/03-whisper-cpp-setup.md
Normal file
@ -0,0 +1,182 @@
|
||||
# 03 — Whisper.cpp Setup
|
||||
|
||||
> Local speech-to-text: installation, GGML models, CLI usage, real-time streaming, and ffmpeg.
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
brew install whisper-cpp
|
||||
```
|
||||
|
||||
- **Version installed:** 1.8.3
|
||||
- **Dependency installed:** sdl2 2.32.10 (audio I/O)
|
||||
- **Binary location:** `/opt/homebrew/bin/whisper-*`
|
||||
|
||||
### Installed Binaries
|
||||
|
||||
| Binary | Purpose |
|
||||
| ----------------------------- | ------------------------------------------------ |
|
||||
| `whisper-cli` | **Main CLI** — transcribe audio files |
|
||||
| `whisper-server` | HTTP server — POST audio, get JSON transcription |
|
||||
| `whisper-stream` | **Real-time** microphone streaming transcription |
|
||||
| `whisper-talk-llama` | Voice → Whisper → LLaMA → TTS pipeline |
|
||||
| `whisper-bench` | Benchmark a model on your hardware |
|
||||
| `whisper-command` | Voice command detection |
|
||||
| `whisper-lsp` | Language Server Protocol integration |
|
||||
| `whisper-quantize` | Quantize models to smaller formats |
|
||||
| `whisper-vad-speech-segments` | Voice Activity Detection — split audio by speech |
|
||||
|
||||
> **Note:** The binary is `whisper-cli`, NOT `whisper-cpp`. The brew formula name differs from the binary name.
|
||||
|
||||
---
|
||||
|
||||
## GGML Model Files
|
||||
|
||||
Whisper.cpp requires separate GGML model files (not included with brew install).
|
||||
|
||||
### Model Size Guide
|
||||
|
||||
| Model | File | Disk Size | Speed | Accuracy |
|
||||
| ------------------ | ----------------------------- | ---------- | --------- | -------------------- |
|
||||
| Tiny (English) | `ggml-tiny.en.bin` | 75 MB | Blazing | Basic |
|
||||
| Base (English) | `ggml-base.en.bin` | 142 MB | Very fast | Good |
|
||||
| Medium (English) | `ggml-medium.en.bin` | 1.5 GB | Fast | Great |
|
||||
| Large v3 | `ggml-large-v3.bin` | 3.1 GB | Moderate | Best |
|
||||
| **Large v3 Turbo** | **`ggml-large-v3-turbo.bin`** | **1.6 GB** | **Fast** | **Best (distilled)** |
|
||||
|
||||
**Recommended for M4 Pro:** `ggml-large-v3-turbo` — best accuracy at half the size of large-v3, Metal-accelerated.
|
||||
|
||||
### Download Models
|
||||
|
||||
Models are stored in `~/whisper-models/`.
|
||||
|
||||
```bash
|
||||
mkdir -p ~/whisper-models
|
||||
|
||||
# Recommended: Large v3 Turbo (~1.6 GB)
|
||||
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
|
||||
|
||||
# Alternative mirror
|
||||
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
https://ggml.ggerganov.com/whisper/ggml-large-v3-turbo.bin
|
||||
```
|
||||
|
||||
**Download sources:**
|
||||
|
||||
- https://huggingface.co/ggerganov/whisper.cpp/tree/main
|
||||
- https://ggml.ggerganov.com/
|
||||
|
||||
### Current Status (2026-02-19)
|
||||
|
||||
❌ **Model download blocked by corporate proxy** (Forcepoint CertChecker intercepts Hugging Face HTTPS). Download from personal/home network required. See [08-troubleshooting.md](08-troubleshooting.md).
|
||||
|
||||
---
|
||||
|
||||
## Audio Format Requirements
|
||||
|
||||
Whisper.cpp requires **WAV** format input (16-bit PCM, ideally 16 kHz mono).
|
||||
|
||||
### ffmpeg Installation
|
||||
|
||||
```bash
|
||||
brew install ffmpeg
|
||||
```
|
||||
|
||||
Version installed: 8.0.1 (with dav1d, lame, libvpx, opus, svt-av1, x264, x265)
|
||||
|
||||
### Converting Audio Files
|
||||
|
||||
```bash
|
||||
# m4a → wav (16kHz mono, optimal for Whisper)
|
||||
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
|
||||
|
||||
# mp3 → wav
|
||||
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
|
||||
|
||||
# Any format → wav
|
||||
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav
|
||||
```
|
||||
|
||||
### Tested Conversion (2026-02-19)
|
||||
|
||||
```bash
|
||||
ffmpeg -i '/Users/sd9235/Downloads/New Recording.m4a' \
|
||||
-ar 16000 -ac 1 \
|
||||
'/Users/sd9235/Downloads/recording.wav'
|
||||
# Result: 181 KB, 5.80 seconds, 16kHz mono
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
### File Transcription
|
||||
|
||||
```bash
|
||||
whisper-cli \
|
||||
--model ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
--language en \
|
||||
--file /path/to/audio.wav
|
||||
```
|
||||
|
||||
### Real-Time Microphone Streaming
|
||||
|
||||
```bash
|
||||
whisper-stream \
|
||||
--model ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
--language en
|
||||
```
|
||||
|
||||
This is particularly relevant for **LysnrAI** — real-time mic transcription locally, no Azure Speech SDK needed for dev/testing.
|
||||
|
||||
### HTTP Server Mode
|
||||
|
||||
```bash
|
||||
# Start server on port 8080
|
||||
whisper-server \
|
||||
--model ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
--port 8080
|
||||
|
||||
# POST audio to get transcription
|
||||
curl -X POST http://localhost:8080/inference \
|
||||
-F "file=@audio.wav" \
|
||||
-F "response_format=json"
|
||||
```
|
||||
|
||||
### Voice Activity Detection
|
||||
|
||||
```bash
|
||||
whisper-vad-speech-segments \
|
||||
--model ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
--file recording.wav
|
||||
```
|
||||
|
||||
### Benchmarking
|
||||
|
||||
```bash
|
||||
whisper-bench --model ~/whisper-models/ggml-large-v3-turbo.bin
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with LysnrAI
|
||||
|
||||
The local Whisper stack can serve as an **offline fallback** or **dev replacement** for Azure Speech SDK:
|
||||
|
||||
| Component | Azure (production) | Whisper.cpp (local dev) |
|
||||
| ------------------ | ------------------ | ------------------------- |
|
||||
| Real-time STT | Azure Speech SDK | `whisper-stream` |
|
||||
| File transcription | Azure Batch | `whisper-cli` |
|
||||
| HTTP API | Azure REST API | `whisper-server` |
|
||||
| Cost | Pay-per-use | $0.00 (local) |
|
||||
| Latency | ~200ms (network) | ~50ms (local Metal) |
|
||||
| Languages | 100+ | 100+ (same Whisper model) |
|
||||
|
||||
### Potential Integration Points
|
||||
|
||||
1. **Desktop app** (`src/audio/azure_stt.py`): Add local Whisper backend option
|
||||
2. **iOS keyboard** (`LysnrKeyboard/`): Use on-device Whisper for offline dictation
|
||||
3. **Extraction service evals**: Transcribe test audio fixtures locally
|
||||
86
__LOCAL_LLMs/docs/README.md
Normal file
86
__LOCAL_LLMs/docs/README.md
Normal file
@ -0,0 +1,86 @@
|
||||
# Local LLM Stack — Documentation Index
|
||||
|
||||
> Complete guide for the local AI inference stack on the ByteLyst development machine.
|
||||
> Hardware: **Apple M4 Pro · 48 GB LPDDR5 · macOS Tahoe**
|
||||
> Last updated: 2026-02-19
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Start Ollama
|
||||
ollama serve # or: brew services start ollama
|
||||
|
||||
# 2. Load a model
|
||||
ollama run qwen2.5-coder:32b # best coding model for this hardware
|
||||
|
||||
# 3. Launch Mission Control dashboard
|
||||
cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
|
||||
# Open http://localhost:3100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Documentation
|
||||
|
||||
| # | Document | Description |
|
||||
| --- | ------------------------------------------------------------ | -------------------------------------------------------------------- |
|
||||
| 01 | [Hardware & Prerequisites](01-hardware-and-prerequisites.md) | Machine specs, installed toolchain, disk/RAM budget |
|
||||
| 02 | [Ollama Setup & Models](02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior |
|
||||
| 03 | [Whisper.cpp Setup](03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI usage, real-time streaming |
|
||||
| 04 | [Multimodal Local Stack](04-multimodal-local-stack.md) | Vision models, audio pipeline, video understanding status |
|
||||
| 05 | [Mission Control Dashboard](05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features, running |
|
||||
| 06 | [Extraction Service Evals](06-extraction-service-evals.md) | promptfoo eval suite, Ollama vs Gemini comparison, Python sidecar |
|
||||
| 07 | [Model Recommendations](07-model-recommendations.md) | Tiered model guide by use case, size, and quality for M4 Pro 48GB |
|
||||
| 08 | [Troubleshooting & Corporate Proxy](08-troubleshooting.md) | Common issues, Forcepoint proxy workarounds, MLX warnings |
|
||||
| 09 | [Environment Variables](09-environment-variables.md) | All config vars for Ollama, Whisper, dashboard, evals |
|
||||
|
||||
---
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
__LOCAL_LLMs/
|
||||
├── README.md ← you are here (moved from LOCAL_LLMs_setup_mac_m4_48gb.md)
|
||||
├── docs/
|
||||
│ ├── README.md ← this index
|
||||
│ ├── 01-hardware-and-prerequisites.md
|
||||
│ ├── 02-ollama-setup-and-models.md
|
||||
│ ├── 03-whisper-cpp-setup.md
|
||||
│ ├── 04-multimodal-local-stack.md
|
||||
│ ├── 05-mission-control-dashboard.md
|
||||
│ ├── 06-extraction-service-evals.md
|
||||
│ ├── 07-model-recommendations.md
|
||||
│ ├── 08-troubleshooting.md
|
||||
│ └── 09-environment-variables.md
|
||||
├── dashboard/ ← Next.js Mission Control app (port 3100)
|
||||
│ ├── src/app/page.tsx ← main dashboard UI
|
||||
│ ├── src/app/api/ollama/route.ts ← Ollama API proxy (list, load, unload, generate)
|
||||
│ ├── src/app/api/whisper/route.ts ← Whisper binary/model discovery
|
||||
│ └── src/app/api/system/route.ts ← System info (chip, RAM, disk, brew)
|
||||
└── LOCAL_LLMs_setup_mac_m4_48gb.md ← original doc (preserved, see docs/ for latest)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Current Installation Status (2026-02-19)
|
||||
|
||||
| Component | Version | Status | Disk Usage |
|
||||
| ----------------------------------- | ---------- | ----------------------------- | ---------- |
|
||||
| Ollama | 0.16.2 | ✅ Installed via brew | — |
|
||||
| qwen2.5-coder:32b | — | ✅ Downloaded | 19 GB |
|
||||
| llama3.1:8b | — | ✅ Downloaded | 4.9 GB |
|
||||
| whisper-cpp | 1.8.3 | ✅ Installed via brew | 9.6 MB |
|
||||
| whisper model (ggml-large-v3-turbo) | — | ❌ Blocked by corporate proxy | — |
|
||||
| ffmpeg | 8.0.1 | ✅ Installed via brew | 53.3 MB |
|
||||
| Mission Control Dashboard | Next.js 16 | ✅ Built, runs on :3100 | — |
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Extraction service evals:** `services/extraction-service/evals/`
|
||||
- **Ollama REST API docs:** https://github.com/ollama/ollama/blob/main/docs/api.md
|
||||
- **Whisper.cpp:** https://github.com/ggerganov/whisper.cpp
|
||||
- **Hugging Face models:** https://huggingface.co/ggerganov/whisper.cpp/tree/main
|
||||
Loading…
Reference in New Issue
Block a user