docs(local-llm): add multimodal stack, model recommendations, and troubleshooting

- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2), audio pipeline architecture, video understanding status, Kimi alternatives, complete local AI stack diagram - docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general, reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB, use-case quick reference, hardware scaling guide - docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning, JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio format conversion, proxy-corrupted downloads detection
2026-02-19 13:01:22 -08:00 · 2026-02-19 13:01:22 -08:00 · 3561deee52
commit 3561deee52
parent 80f794dee7
3 changed files with 498 additions and 0 deletions
--- a/__LOCAL_LLMs/docs/04-multimodal-local-stack.md
+++ b/__LOCAL_LLMs/docs/04-multimodal-local-stack.md
@ -0,0 +1,171 @@
+# 04 — Multimodal Local Stack
+
+> Vision models, audio pipeline architecture, and video understanding status.
+
+---
+
+## Overview
+
+A fully local multimodal AI stack requires three pipelines:
+
+```
+Audio In  →  whisper.cpp (STT)   →  text
+Image In  →  vision LLM (Ollama) →  text description / analysis
+Video In  →  frame extraction    →  vision LLM per frame → text
+Text      →  LLM (Ollama)        →  response
+Text Out  →  TTS (optional)      →  audio
+```
+
+---
+
+## Vision Models (Image Understanding) ✅ Available
+
+These run on Ollama and accept image input alongside text prompts.
+
+### Available on Ollama
+
+| Model             | Size   | RAM Needed              | Capability                      |
+| ----------------- | ------ | ----------------------- | ------------------------------- |
+| `qwen2.5vl:72b`   | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally             |
+| `llava:34b`       | ~22 GB | ~22 GB                  | Strong image understanding, OCR |
+| `llava:13b`       | ~9 GB  | ~9 GB                   | Good balance                    |
+| `llava-llama3:8b` | ~6 GB  | ~6 GB                   | LLaVA on Llama3 base            |
+| `minicpm-v:8b`    | ~6 GB  | ~6 GB                   | Strong vision + OCR             |
+| `qwen2.5vl:7b`    | ~6 GB  | ~6 GB                   | Qwen vision — very capable      |
+| `moondream2`      | ~2 GB  | ~2 GB                   | Tiny, fast, basic vision        |
+
+### Recommended for M4 Pro 48 GB
+
+- **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB)
+- **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools
+- **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive
+
+### Pull & Use
+
+```bash
+# Pull a vision model
+ollama pull llava:34b
+
+# Use with an image
+ollama run llava:34b "Describe this image" --images /path/to/image.png
+```
+
+### Vision API
+
+```bash
+curl http://localhost:11434/api/generate -d '{
+  "model": "llava:34b",
+  "prompt": "What do you see in this image?",
+  "images": ["<base64-encoded-image>"]
+}'
+```
+
+---
+
+## Audio Pipeline ✅ Available (via Whisper.cpp)
+
+No Ollama models handle audio input natively. Audio requires a separate pipeline:
+
+```
+Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS
+```
+
+### Components
+
+| Stage              | Tool                                          | Status                       |
+| ------------------ | --------------------------------------------- | ---------------------------- |
+| **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model |
+| **LLM Processing** | Ollama (any model)                            | ✅ Ready                     |
+| **Text-to-Speech** | kokoro / piper (local TTS)                    | Not installed                |
+
+### Real-Time Voice → LLM Pipeline
+
+```bash
+# whisper-talk-llama does this in one binary!
+whisper-talk-llama \
+  --model ~/whisper-models/ggml-large-v3-turbo.bin \
+  --llama-model ~/.ollama/models/... \
+  --language en
+```
+
+> Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured.
+
+### Local TTS Options (Not Yet Installed)
+
+| Tool          | Quality   | Speed     | Install                  |
+| ------------- | --------- | --------- | ------------------------ |
+| **Kokoro**    | Excellent | Fast      | `pip install kokoro`     |
+| **Piper**     | Good      | Very fast | `brew install piper`     |
+| **espeak-ng** | Basic     | Instant   | `brew install espeak-ng` |
+
+---
+
+## Video Understanding ❌ Not Available Locally
+
+**True video understanding is not yet available on local models.** Current state:
+
+- No end-to-end video model runs on Ollama
+- Workaround: Extract frames with ffmpeg → process each frame with a vision model
+- Practical for screenshots/thumbnails, not for real video understanding
+
+### Frame Extraction Workaround
+
+```bash
+# Extract 1 frame per second from video
+ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg
+
+# Then process each frame with vision model
+for f in frames/*.jpg; do
+  ollama run llava:34b "Describe what's happening" --images "$f"
+done
+```
+
+### For Real Video Understanding
+
+Use cloud APIs:
+
+- **Google Gemini** — native video input support
+- **OpenAI GPT-4o** — accepts video frames
+- **Anthropic Claude** — accepts images (frame extraction needed)
+
+---
+
+## Kimi (Moonshot AI) — Cloud Only
+
+Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists.
+
+### Closest Local Alternatives
+
+| Looking For              | Local Alternative   | Why Similar                    |
+| ------------------------ | ------------------- | ------------------------------ |
+| Kimi k1.5 (reasoning)    | `deepseek-r1:32b`   | Strong reasoning, long context |
+| Kimi long context (128k) | `qwen2.5:72b`       | Same tier, 128k context        |
+| Kimi coding              | `qwen2.5-coder:32b` | Best local coding model        |
+
+```bash
+ollama pull deepseek-r1:32b   # 20 GB, reasoning-focused
+```
+
+---
+
+## Complete Local AI Stack Summary
+
+```
+┌─────────────────────────────────────────────────────┐
+│              Local AI Stack (M4 Pro 48GB)            │
+├─────────────────────────────────────────────────────┤
+│                                                     │
+│  Audio In ──→ whisper-cpp (STT) ──→ Text            │
+│                                      │              │
+│  Image In ──→ llava:34b (Vision) ──→ Text           │
+│                                      │              │
+│  Text ──────→ qwen2.5-coder:32b ──→ Response        │
+│                                      │              │
+│  Response ──→ kokoro/piper (TTS) ──→ Audio Out      │
+│                                                     │
+│  Server: Ollama :11434                              │
+│  Dashboard: Mission Control :3100                   │
+│  Whisper Server: whisper-server :8080 (optional)    │
+│                                                     │
+└─────────────────────────────────────────────────────┘
+```
--- a/__LOCAL_LLMs/docs/07-model-recommendations.md
+++ b/__LOCAL_LLMs/docs/07-model-recommendations.md
@ -0,0 +1,106 @@
+# 07 — Model Recommendations
+
+> Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.
+
+---
+
+## Tier 1 — Best Overall Coding Models
+
+| Model                   | Size  | RAM Used | Pull Command                    | Notes                                              |
+| ----------------------- | ----- | -------- | ------------------------------- | -------------------------------------------------- |
+| **`qwen2.5-coder:32b`** | 19 GB | ~22 GB   | `ollama pull qwen2.5-coder:32b` | **Top pick** — rivals GPT-4o on code, 128k context |
+| `deepseek-coder-v2:16b` | 10 GB | ~12 GB   | `ollama pull deepseek-coder-v2` | Best open-source coding model at 16B               |
+| `codestral:22b`         | 13 GB | ~15 GB   | `ollama pull codestral`         | Mistral's coding model, very fast completions      |
+
+## Tier 2 — Fast & Capable (Speed/Quality Balance)
+
+| Model                  | Size | RAM Used | Pull Command                      | Notes                                         |
+| ---------------------- | ---- | -------- | --------------------------------- | --------------------------------------------- |
+| **`qwen2.5-coder:7b`** | 5 GB | ~6 GB    | `ollama pull qwen2.5-coder:7b`    | Fast, surprisingly good for TS/Python/Swift   |
+| `deepseek-coder:6.7b`  | 4 GB | ~5 GB    | `ollama pull deepseek-coder:6.7b` | Lightweight, solid everyday coding            |
+| `codegemma:7b`         | 5 GB | ~6 GB    | `ollama pull codegemma:7b`        | Google's model, decent but outclassed by Qwen |
+
+## Tier 3 — General Purpose (Also Good at Code)
+
+| Model               | Size   | RAM Used | Pull Command               | Notes                               |
+| ------------------- | ------ | -------- | -------------------------- | ----------------------------------- |
+| `llama3.1:70b` (Q4) | 40 GB  | ~42 GB   | `ollama pull llama3.1:70b` | Best general model — tight on 48 GB |
+| `llama3.1:8b`       | 4.9 GB | ~6 GB    | `ollama pull llama3.1:8b`  | Fast, good for evals                |
+| `mistral-nemo:12b`  | 7 GB   | ~9 GB    | `ollama pull mistral-nemo` | Fast reasoning                      |
+| `phi4:14b`          | 9 GB   | ~11 GB   | `ollama pull phi4`         | Strong reasoning, fits comfortably  |
+
+## Tier 4 — Reasoning & Deep Thinking
+
+| Model                 | Size  | RAM Used | Pull Command                  | Notes                                            |
+| --------------------- | ----- | -------- | ----------------------------- | ------------------------------------------------ |
+| **`deepseek-r1:32b`** | 20 GB | ~22 GB   | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning, closest to Kimi k1.5 |
+| `deepseek-r1:7b`      | 5 GB  | ~6 GB    | `ollama pull deepseek-r1:7b`  | Lightweight reasoning                            |
+
+## Tier 5 — Vision (Multimodal)
+
+| Model          | Size  | RAM Used | Pull Command               | Notes                    |
+| -------------- | ----- | -------- | -------------------------- | ------------------------ |
+| `llava:34b`    | 22 GB | ~22 GB   | `ollama pull llava:34b`    | Image understanding, OCR |
+| `qwen2.5vl:7b` | 6 GB  | ~6 GB    | `ollama pull qwen2.5vl:7b` | Qwen vision, fast        |
+| `minicpm-v:8b` | 6 GB  | ~6 GB    | `ollama pull minicpm-v:8b` | Strong OCR               |
+| `moondream2`   | 2 GB  | ~2 GB    | `ollama pull moondream2`   | Tiny, basic vision       |
+
+## Tier 6 — Embeddings
+
+| Model               | Size   | RAM Used | Pull Command                    | Notes                     |
+| ------------------- | ------ | -------- | ------------------------------- | ------------------------- |
+| `nomic-embed-text`  | 0.3 GB | ~0.5 GB  | `ollama pull nomic-embed-text`  | Good for semantic search  |
+| `mxbai-embed-large` | 0.7 GB | ~1 GB    | `ollama pull mxbai-embed-large` | Higher quality embeddings |
+
+---
+
+## Recommended 10-Model Stack for M4 Pro 48 GB
+
+For maximum coverage across all use cases:
+
+| #   | Model                   | Disk        | Use Case                                 |
+| --- | ----------------------- | ----------- | ---------------------------------------- |
+| 1   | `qwen2.5-coder:32b`     | 19 GB       | **Primary** — coding (TS, Python, Swift) |
+| 2   | `qwen2.5-coder:7b`      | 5 GB        | Fast coding completions                  |
+| 3   | `deepseek-coder-v2:16b` | 10 GB       | Alternative coding model                 |
+| 4   | `llama3.1:8b`           | 4.9 GB      | Eval default, general tasks              |
+| 5   | `deepseek-r1:32b`       | 20 GB       | Deep reasoning, complex triage           |
+| 6   | `codestral:22b`         | 13 GB       | Fast code completions (Mistral)          |
+| 7   | `phi4:14b`              | 9 GB        | Reasoning, structured output             |
+| 8   | `llava:34b`             | 22 GB       | Vision / image understanding             |
+| 9   | `mistral-nemo:12b`      | 7 GB        | Fast general purpose                     |
+| 10  | `nomic-embed-text`      | 0.3 GB      | Embeddings / semantic search             |
+|     | **Total**               | **~115 GB** |                                          |
+
+Only one loads into RAM at a time. You can have all 10 on disk simultaneously.
+
+---
+
+## By Use Case (Quick Reference)
+
+| Use Case                   | Best Model          | Fallback                |
+| -------------------------- | ------------------- | ----------------------- |
+| **TypeScript/ESM coding**  | `qwen2.5-coder:32b` | `qwen2.5-coder:7b`      |
+| **Python coding**          | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` |
+| **Swift/iOS coding**       | `qwen2.5-coder:32b` | `codestral:22b`         |
+| **Extraction evals**       | `llama3.1:8b`       | `qwen2.5:7b`            |
+| **JSON structured output** | `qwen2.5:7b`        | `qwen2.5-coder:7b`      |
+| **Complex reasoning**      | `deepseek-r1:32b`   | `phi4:14b`              |
+| **Image understanding**    | `llava:34b`         | `qwen2.5vl:7b`          |
+| **Embeddings**             | `nomic-embed-text`  | `mxbai-embed-large`     |
+| **Fast iteration**         | `qwen2.5-coder:7b`  | `llama3.1:8b`           |
+
+---
+
+## Hardware Guide (General)
+
+For reference if running on different hardware:
+
+| RAM    | Max Model Size | Recommendation                        |
+| ------ | -------------- | ------------------------------------- |
+| 8 GB   | 7B             | `qwen2.5-coder:7b`                    |
+| 16 GB  | 13-16B         | `deepseek-coder-v2:16b`               |
+| 24 GB  | 32B            | `qwen2.5-coder:32b`                   |
+| 32 GB  | 32B + headroom | `qwen2.5-coder:32b` (comfortable)     |
+| 48 GB  | 70B (Q4)       | `llama3.1:70b` or any 32B comfortably |
+| 64 GB+ | 70B (Q8)       | Full precision 70B models             |
--- a/__LOCAL_LLMs/docs/08-troubleshooting.md
+++ b/__LOCAL_LLMs/docs/08-troubleshooting.md
@ -0,0 +1,221 @@
+# 08 — Troubleshooting & Corporate Proxy
+
+> Common issues, Forcepoint proxy workarounds, MLX warnings, and fixes.
+
+---
+
+## Corporate Proxy (Forcepoint CertChecker)
+
+This machine is behind an AT&T Forcepoint proxy that performs SSL deep packet inspection.
+
+### Proxy Details
+
+| Setting   | Value                                         |
+| --------- | --------------------------------------------- |
+| Proxy URL | `http://cso.proxy.att.com:8080/`              |
+| Agent     | Forcepoint CertChecker                        |
+| Impact    | Intercepts HTTPS, replaces TLS certificates   |
+| Env vars  | `HTTP_PROXY`, `HTTPS_PROXY` set automatically |
+
+### What Works Through Proxy
+
+| Tool                       | Status     | Notes                                 |
+| -------------------------- | ---------- | ------------------------------------- |
+| `ollama pull`              | ✅ Works   | Ollama handles proxy natively         |
+| `brew install`             | ✅ Works   | Homebrew handles proxy                |
+| `npm install`              | ✅ Works   | With `NODE_TLS_REJECT_UNAUTHORIZED=0` |
+| `curl` to Hugging Face     | ❌ Blocked | Returns 19 KB HTML redirect page      |
+| `curl -k` to Hugging Face  | ❌ Blocked | Still intercepted even with `-k`      |
+| `python requests` to HF    | ❌ Blocked | SSL_CERTIFICATE_VERIFY_FAILED         |
+| `huggingface_hub` download | ❌ Blocked | Falls back to cached (broken) files   |
+
+### Workaround: Download Off-Network
+
+For Hugging Face model downloads (e.g., Whisper GGML files):
+
+1. **Disconnect** from corporate VPN/Wi-Fi
+2. **Connect** to personal hotspot or home Wi-Fi
+3. Run the download:
+   ```bash
+   curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
+     https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
+   ```
+4. **Reconnect** to corporate. The model is stored locally forever.
+
+### Detecting a Proxy-Corrupted Download
+
+If a download completed but the file is suspiciously small:
+
+```bash
+# Check file size (should be ~1.6 GB for large-v3-turbo)
+ls -lh ~/whisper-models/ggml-large-v3-turbo.bin
+
+# Check file type (should NOT be HTML)
+file ~/whisper-models/ggml-large-v3-turbo.bin
+
+# If it says "HTML document text" — delete and re-download off-network
+rm ~/whisper-models/ggml-large-v3-turbo.bin
+```
+
+---
+
+## Ollama Issues
+
+### `MLX dynamic library not available`
+
+```
+WARN MLX dynamic library not available error="failed to load MLX dynamic library"
+```
+
+**Severity:** Harmless
+**Cause:** Ollama searches for Apple MLX framework but it's not installed
+**Impact:** None — falls back to Metal backend which is fully functional on M4 Pro
+**Fix:** None needed. Ignore the warning.
+
+### Model Pull Fails (SSL / Proxy)
+
+```bash
+# Try bypassing proxy for Ollama registry
+NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
+```
+
+### Ollama Not Responding
+
+```bash
+# Check if running
+curl http://localhost:11434/api/tags
+
+# Restart
+brew services restart ollama
+# or
+pkill ollama && ollama serve
+```
+
+### JSON Parse Errors in Evals
+
+Model returned markdown-wrapped JSON (` ```json ... ``` `). Fix by adding to your prompt:
+
+```
+Return ONLY a valid JSON object — no markdown, no backticks, no explanation.
+```
+
+### Slow Inference
+
+Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only:
+
+```bash
+# Restart with performance flags
+OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
+```
+
+### Model Won't Unload from RAM
+
+```bash
+# Force unload via API
+curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "prompt": "", "keep_alive": "0"}'
+
+# Or restart Ollama entirely
+brew services restart ollama
+```
+
+### Disk Space Running Low
+
+```bash
+# Check Ollama disk usage
+du -sh ~/.ollama/models/
+
+# List models with sizes
+ollama list
+
+# Remove models you don't need
+ollama rm <model-name>
+```
+
+---
+
+## Whisper.cpp Issues
+
+### `command not found: whisper-cpp`
+
+The binary is named `whisper-cli`, NOT `whisper-cpp`:
+
+```bash
+# Wrong
+whisper-cpp --model ...
+
+# Correct
+whisper-cli --model ...
+```
+
+Full list of binaries: `ls /opt/homebrew/bin/whisper-*`
+
+### Audio Format Not Supported
+
+Whisper.cpp requires WAV format. Convert first:
+
+```bash
+# m4a → wav (16kHz mono)
+ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
+
+# mp3 → wav
+ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
+```
+
+### Model File Is HTML (Proxy-Corrupted)
+
+```bash
+file ~/whisper-models/ggml-large-v3-turbo.bin
+# If output says "HTML document text" — it's corrupted
+rm ~/whisper-models/ggml-large-v3-turbo.bin
+# Re-download off corporate network
+```
+
+### `ffmpeg: command not found`
+
+```bash
+brew install ffmpeg
+```
+
+---
+
+## Dashboard Issues
+
+### Port Conflict
+
+```bash
+# Default port 3100, change if needed
+npm run dev -- -p 3101
+```
+
+### Lockfile Warning
+
+```
+Warning: Next.js inferred your workspace root, but it may not be correct.
+We detected multiple lockfiles...
+```
+
+This is harmless — the dashboard has its own `package-lock.json` inside the pnpm monorepo. Can be silenced by adding `turbopack.root` to `next.config.ts`.
+
+### API Routes Return Empty Data
+
+- **Ollama offline:** Start with `ollama serve` or `brew services start ollama`
+- **Whisper not installed:** Run `brew install whisper-cpp`
+- **No models:** Check `ollama list` and `ls ~/whisper-models/`
+
+---
+
+## General macOS Issues
+
+### Accessibility / Permissions
+
+Some tools (e.g., `whisper-stream` for mic access) need explicit macOS permissions:
+
+System Settings → Privacy & Security → Microphone → enable Terminal / your IDE
+
+### Node.js TLS Warning
+
+```
+Warning: Setting NODE_TLS_REJECT_UNAUTHORIZED to '0' makes TLS connections insecure
+```
+
+This is set in the corporate environment to handle Forcepoint proxy. Harmless for local dev.