docs(local-llm): add multimodal stack, model recommendations, and troubleshooting

- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2), audio pipeline architecture, video understanding status, Kimi alternatives, complete local AI stack diagram - docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general, reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB, use-case quick reference, hardware scaling guide - docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning, JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio format conversion, proxy-corrupted downloads detection
2026-02-19 13:01:22 -08:00 · 2026-02-19 13:01:22 -08:00 · 3561deee52
commit 3561deee52
parent 80f794dee7
3 changed files with 498 additions and 0 deletions
--- a/__LOCAL_LLMs/docs/04-multimodal-local-stack.md
+++ b/__LOCAL_LLMs/docs/04-multimodal-local-stack.md
@ -0,0 +1,171 @@
 # 04 — Multimodal Local Stack
 > Vision models, audio pipeline architecture, and video understanding status.
 ---
 ## Overview
 A fully local multimodal AI stack requires three pipelines:
 ```
 Audio In  →  whisper.cpp (STT)   →  text
 Image In  →  vision LLM (Ollama) →  text description / analysis
 Video In  →  frame extraction    →  vision LLM per frame → text
 Text      →  LLM (Ollama)        →  response
 Text Out  →  TTS (optional)      →  audio
 ```
 ---
 ## Vision Models (Image Understanding) ✅ Available
 These run on Ollama and accept image input alongside text prompts.
 ### Available on Ollama
 | Model             | Size   | RAM Needed              | Capability                      |
 | ----------------- | ------ | ----------------------- | ------------------------------- |
 | `qwen2.5vl:72b`   | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally             |
 | `llava:34b`       | ~22 GB | ~22 GB                  | Strong image understanding, OCR |
 | `llava:13b`       | ~9 GB  | ~9 GB                   | Good balance                    |
 | `llava-llama3:8b` | ~6 GB  | ~6 GB                   | LLaVA on Llama3 base            |
 | `minicpm-v:8b`    | ~6 GB  | ~6 GB                   | Strong vision + OCR             |
 | `qwen2.5vl:7b`    | ~6 GB  | ~6 GB                   | Qwen vision — very capable      |
 | `moondream2`      | ~2 GB  | ~2 GB                   | Tiny, fast, basic vision        |
 ### Recommended for M4 Pro 48 GB
 - **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB)
 - **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools
 - **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive
 ### Pull & Use
 ```bash
 # Pull a vision model
 ollama pull llava:34b
 # Use with an image
 ollama run llava:34b "Describe this image" --images /path/to/image.png
 ```
 ### Vision API
 ```bash
 curl http://localhost:11434/api/generate -d '{
  "model": "llava:34b",
  "prompt": "What do you see in this image?",
  "images": ["<base64-encoded-image>"]
 }'
 ```
 ---
 ## Audio Pipeline ✅ Available (via Whisper.cpp)
 No Ollama models handle audio input natively. Audio requires a separate pipeline:
 ```
 Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS
 ```
 ### Components
 | Stage              | Tool                                          | Status                       |
 | ------------------ | --------------------------------------------- | ---------------------------- |
 | **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model |
 | **LLM Processing** | Ollama (any model)                            | ✅ Ready                     |
 | **Text-to-Speech** | kokoro / piper (local TTS)                    | Not installed                |
 ### Real-Time Voice → LLM Pipeline
 ```bash
 # whisper-talk-llama does this in one binary!
 whisper-talk-llama \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --llama-model ~/.ollama/models/... \
  --language en
 ```
 > Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured.
 ### Local TTS Options (Not Yet Installed)
 | Tool          | Quality   | Speed     | Install                  |
 | ------------- | --------- | --------- | ------------------------ |
 | **Kokoro**    | Excellent | Fast      | `pip install kokoro`     |
 | **Piper**     | Good      | Very fast | `brew install piper`     |
 | **espeak-ng** | Basic     | Instant   | `brew install espeak-ng` |
 ---
 ## Video Understanding ❌ Not Available Locally
 **True video understanding is not yet available on local models.** Current state:
 - No end-to-end video model runs on Ollama
 - Workaround: Extract frames with ffmpeg → process each frame with a vision model
 - Practical for screenshots/thumbnails, not for real video understanding
 ### Frame Extraction Workaround
 ```bash
 # Extract 1 frame per second from video
 ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg
 # Then process each frame with vision model
 for f in frames/*.jpg; do
  ollama run llava:34b "Describe what's happening" --images "$f"
 done
 ```
 ### For Real Video Understanding
 Use cloud APIs:
 - **Google Gemini** — native video input support
 - **OpenAI GPT-4o** — accepts video frames
 - **Anthropic Claude** — accepts images (frame extraction needed)
 ---
 ## Kimi (Moonshot AI) — Cloud Only
 Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists.
 ### Closest Local Alternatives
 | Looking For              | Local Alternative   | Why Similar                    |
 | ------------------------ | ------------------- | ------------------------------ |
 | Kimi k1.5 (reasoning)    | `deepseek-r1:32b`   | Strong reasoning, long context |
 | Kimi long context (128k) | `qwen2.5:72b`       | Same tier, 128k context        |
 | Kimi coding              | `qwen2.5-coder:32b` | Best local coding model        |
 ```bash
 ollama pull deepseek-r1:32b   # 20 GB, reasoning-focused
 ```
 ---
 ## Complete Local AI Stack Summary
 ```
 ┌─────────────────────────────────────────────────────┐
 │              Local AI Stack (M4 Pro 48GB)            │
 ├─────────────────────────────────────────────────────┤
 │                                                     │
 │  Audio In ──→ whisper-cpp (STT) ──→ Text            │
 │                                      │              │
 │  Image In ──→ llava:34b (Vision) ──→ Text           │
 │                                      │              │
 │  Text ──────→ qwen2.5-coder:32b ──→ Response        │
 │                                      │              │
 │  Response ──→ kokoro/piper (TTS) ──→ Audio Out      │
 │                                                     │
 │  Server: Ollama :11434                              │
 │  Dashboard: Mission Control :3100                   │
 │  Whisper Server: whisper-server :8080 (optional)    │
 │                                                     │
 └─────────────────────────────────────────────────────┘
 ```
--- a/__LOCAL_LLMs/docs/07-model-recommendations.md
+++ b/__LOCAL_LLMs/docs/07-model-recommendations.md
@ -0,0 +1,106 @@
 # 07 — Model Recommendations
 > Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.
 ---
 ## Tier 1 — Best Overall Coding Models
 | Model                   | Size  | RAM Used | Pull Command                    | Notes                                              |
 | ----------------------- | ----- | -------- | ------------------------------- | -------------------------------------------------- |
 | **`qwen2.5-coder:32b`** | 19 GB | ~22 GB   | `ollama pull qwen2.5-coder:32b` | **Top pick** — rivals GPT-4o on code, 128k context |
 | `deepseek-coder-v2:16b` | 10 GB | ~12 GB   | `ollama pull deepseek-coder-v2` | Best open-source coding model at 16B               |
 | `codestral:22b`         | 13 GB | ~15 GB   | `ollama pull codestral`         | Mistral's coding model, very fast completions      |
 ## Tier 2 — Fast & Capable (Speed/Quality Balance)
 | Model                  | Size | RAM Used | Pull Command                      | Notes                                         |
 | ---------------------- | ---- | -------- | --------------------------------- | --------------------------------------------- |
 | **`qwen2.5-coder:7b`** | 5 GB | ~6 GB    | `ollama pull qwen2.5-coder:7b`    | Fast, surprisingly good for TS/Python/Swift   |
 | `deepseek-coder:6.7b`  | 4 GB | ~5 GB    | `ollama pull deepseek-coder:6.7b` | Lightweight, solid everyday coding            |
 | `codegemma:7b`         | 5 GB | ~6 GB    | `ollama pull codegemma:7b`        | Google's model, decent but outclassed by Qwen |
 ## Tier 3 — General Purpose (Also Good at Code)
 | Model               | Size   | RAM Used | Pull Command               | Notes                               |
 | ------------------- | ------ | -------- | -------------------------- | ----------------------------------- |
 | `llama3.1:70b` (Q4) | 40 GB  | ~42 GB   | `ollama pull llama3.1:70b` | Best general model — tight on 48 GB |
 | `llama3.1:8b`       | 4.9 GB | ~6 GB    | `ollama pull llama3.1:8b`  | Fast, good for evals                |
 | `mistral-nemo:12b`  | 7 GB   | ~9 GB    | `ollama pull mistral-nemo` | Fast reasoning                      |
 | `phi4:14b`          | 9 GB   | ~11 GB   | `ollama pull phi4`         | Strong reasoning, fits comfortably  |
 ## Tier 4 — Reasoning & Deep Thinking
 | Model                 | Size  | RAM Used | Pull Command                  | Notes                                            |
 | --------------------- | ----- | -------- | ----------------------------- | ------------------------------------------------ |
 | **`deepseek-r1:32b`** | 20 GB | ~22 GB   | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning, closest to Kimi k1.5 |
 | `deepseek-r1:7b`      | 5 GB  | ~6 GB    | `ollama pull deepseek-r1:7b`  | Lightweight reasoning                            |
 ## Tier 5 — Vision (Multimodal)
 | Model          | Size  | RAM Used | Pull Command               | Notes                    |
 | -------------- | ----- | -------- | -------------------------- | ------------------------ |
 | `llava:34b`    | 22 GB | ~22 GB   | `ollama pull llava:34b`    | Image understanding, OCR |
 | `qwen2.5vl:7b` | 6 GB  | ~6 GB    | `ollama pull qwen2.5vl:7b` | Qwen vision, fast        |
 | `minicpm-v:8b` | 6 GB  | ~6 GB    | `ollama pull minicpm-v:8b` | Strong OCR               |
 | `moondream2`   | 2 GB  | ~2 GB    | `ollama pull moondream2`   | Tiny, basic vision       |
 ## Tier 6 — Embeddings
 | Model               | Size   | RAM Used | Pull Command                    | Notes                     |
 | ------------------- | ------ | -------- | ------------------------------- | ------------------------- |
 | `nomic-embed-text`  | 0.3 GB | ~0.5 GB  | `ollama pull nomic-embed-text`  | Good for semantic search  |
 | `mxbai-embed-large` | 0.7 GB | ~1 GB    | `ollama pull mxbai-embed-large` | Higher quality embeddings |
 ---
 ## Recommended 10-Model Stack for M4 Pro 48 GB
 For maximum coverage across all use cases:
 | #   | Model                   | Disk        | Use Case                                 |
 | --- | ----------------------- | ----------- | ---------------------------------------- |
 | 1   | `qwen2.5-coder:32b`     | 19 GB       | **Primary** — coding (TS, Python, Swift) |
 | 2   | `qwen2.5-coder:7b`      | 5 GB        | Fast coding completions                  |
 | 3   | `deepseek-coder-v2:16b` | 10 GB       | Alternative coding model                 |
 | 4   | `llama3.1:8b`           | 4.9 GB      | Eval default, general tasks              |
 | 5   | `deepseek-r1:32b`       | 20 GB       | Deep reasoning, complex triage           |
 | 6   | `codestral:22b`         | 13 GB       | Fast code completions (Mistral)          |
 | 7   | `phi4:14b`              | 9 GB        | Reasoning, structured output             |
 | 8   | `llava:34b`             | 22 GB       | Vision / image understanding             |
 | 9   | `mistral-nemo:12b`      | 7 GB        | Fast general purpose                     |
 | 10  | `nomic-embed-text`      | 0.3 GB      | Embeddings / semantic search             |
 |     | **Total**               | **~115 GB** |                                          |
 Only one loads into RAM at a time. You can have all 10 on disk simultaneously.
 ---
 ## By Use Case (Quick Reference)
 | Use Case                   | Best Model          | Fallback                |
 | -------------------------- | ------------------- | ----------------------- |
 | **TypeScript/ESM coding**  | `qwen2.5-coder:32b` | `qwen2.5-coder:7b`      |
 | **Python coding**          | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` |
 | **Swift/iOS coding**       | `qwen2.5-coder:32b` | `codestral:22b`         |
 | **Extraction evals**       | `llama3.1:8b`       | `qwen2.5:7b`            |
 | **JSON structured output** | `qwen2.5:7b`        | `qwen2.5-coder:7b`      |
 | **Complex reasoning**      | `deepseek-r1:32b`   | `phi4:14b`              |
 | **Image understanding**    | `llava:34b`         | `qwen2.5vl:7b`          |
 | **Embeddings**             | `nomic-embed-text`  | `mxbai-embed-large`     |
 | **Fast iteration**         | `qwen2.5-coder:7b`  | `llama3.1:8b`           |
 ---
 ## Hardware Guide (General)
 For reference if running on different hardware:
 | RAM    | Max Model Size | Recommendation                        |
 | ------ | -------------- | ------------------------------------- |
 | 8 GB   | 7B             | `qwen2.5-coder:7b`                    |
 | 16 GB  | 13-16B         | `deepseek-coder-v2:16b`               |
 | 24 GB  | 32B            | `qwen2.5-coder:32b`                   |
 | 32 GB  | 32B + headroom | `qwen2.5-coder:32b` (comfortable)     |
 | 48 GB  | 70B (Q4)       | `llama3.1:70b` or any 32B comfortably |
 | 64 GB+ | 70B (Q8)       | Full precision 70B models             |
--- a/__LOCAL_LLMs/docs/08-troubleshooting.md
+++ b/__LOCAL_LLMs/docs/08-troubleshooting.md
@ -0,0 +1,221 @@
 # 08 — Troubleshooting & Corporate Proxy
 > Common issues, Forcepoint proxy workarounds, MLX warnings, and fixes.
 ---
 ## Corporate Proxy (Forcepoint CertChecker)
 This machine is behind an AT&T Forcepoint proxy that performs SSL deep packet inspection.
 ### Proxy Details
 | Setting   | Value                                         |
 | --------- | --------------------------------------------- |
 | Proxy URL | `http://cso.proxy.att.com:8080/`              |
 | Agent     | Forcepoint CertChecker                        |
 | Impact    | Intercepts HTTPS, replaces TLS certificates   |
 | Env vars  | `HTTP_PROXY`, `HTTPS_PROXY` set automatically |
 ### What Works Through Proxy
 | Tool                       | Status     | Notes                                 |
 | -------------------------- | ---------- | ------------------------------------- |
 | `ollama pull`              | ✅ Works   | Ollama handles proxy natively         |
 | `brew install`             | ✅ Works   | Homebrew handles proxy                |
 | `npm install`              | ✅ Works   | With `NODE_TLS_REJECT_UNAUTHORIZED=0` |
 | `curl` to Hugging Face     | ❌ Blocked | Returns 19 KB HTML redirect page      |
 | `curl -k` to Hugging Face  | ❌ Blocked | Still intercepted even with `-k`      |
 | `python requests` to HF    | ❌ Blocked | SSL_CERTIFICATE_VERIFY_FAILED         |
 | `huggingface_hub` download | ❌ Blocked | Falls back to cached (broken) files   |
 ### Workaround: Download Off-Network
 For Hugging Face model downloads (e.g., Whisper GGML files):
 1. **Disconnect** from corporate VPN/Wi-Fi
 2. **Connect** to personal hotspot or home Wi-Fi
 3. Run the download:
   ```bash
   curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
     https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
   ```
 4. **Reconnect** to corporate. The model is stored locally forever.
 ### Detecting a Proxy-Corrupted Download
 If a download completed but the file is suspiciously small:
 ```bash
 # Check file size (should be ~1.6 GB for large-v3-turbo)
 ls -lh ~/whisper-models/ggml-large-v3-turbo.bin
 # Check file type (should NOT be HTML)
 file ~/whisper-models/ggml-large-v3-turbo.bin
 # If it says "HTML document text" — delete and re-download off-network
 rm ~/whisper-models/ggml-large-v3-turbo.bin
 ```
 ---
 ## Ollama Issues
 ### `MLX dynamic library not available`
 ```
 WARN MLX dynamic library not available error="failed to load MLX dynamic library"
 ```
 **Severity:** Harmless
 **Cause:** Ollama searches for Apple MLX framework but it's not installed
 **Impact:** None — falls back to Metal backend which is fully functional on M4 Pro
 **Fix:** None needed. Ignore the warning.
 ### Model Pull Fails (SSL / Proxy)
 ```bash
 # Try bypassing proxy for Ollama registry
 NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
 ```
 ### Ollama Not Responding
 ```bash
 # Check if running
 curl http://localhost:11434/api/tags
 # Restart
 brew services restart ollama
 # or
 pkill ollama && ollama serve
 ```
 ### JSON Parse Errors in Evals
 Model returned markdown-wrapped JSON (` ```json ... ``` `). Fix by adding to your prompt:
 ```
 Return ONLY a valid JSON object — no markdown, no backticks, no explanation.
 ```
 ### Slow Inference
 Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only:
 ```bash
 # Restart with performance flags
 OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
 ```
 ### Model Won't Unload from RAM
 ```bash
 # Force unload via API
 curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "prompt": "", "keep_alive": "0"}'
 # Or restart Ollama entirely
 brew services restart ollama
 ```
 ### Disk Space Running Low
 ```bash
 # Check Ollama disk usage
 du -sh ~/.ollama/models/
 # List models with sizes
 ollama list
 # Remove models you don't need
 ollama rm <model-name>
 ```
 ---
 ## Whisper.cpp Issues
 ### `command not found: whisper-cpp`
 The binary is named `whisper-cli`, NOT `whisper-cpp`:
 ```bash
 # Wrong
 whisper-cpp --model ...
 # Correct
 whisper-cli --model ...
 ```
 Full list of binaries: `ls /opt/homebrew/bin/whisper-*`
 ### Audio Format Not Supported
 Whisper.cpp requires WAV format. Convert first:
 ```bash
 # m4a → wav (16kHz mono)
 ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
 # mp3 → wav
 ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
 ```
 ### Model File Is HTML (Proxy-Corrupted)
 ```bash
 file ~/whisper-models/ggml-large-v3-turbo.bin
 # If output says "HTML document text" — it's corrupted
 rm ~/whisper-models/ggml-large-v3-turbo.bin
 # Re-download off corporate network
 ```
 ### `ffmpeg: command not found`
 ```bash
 brew install ffmpeg
 ```
 ---
 ## Dashboard Issues
 ### Port Conflict
 ```bash
 # Default port 3100, change if needed
 npm run dev -- -p 3101
 ```
 ### Lockfile Warning
 ```
 Warning: Next.js inferred your workspace root, but it may not be correct.
 We detected multiple lockfiles...
 ```
 This is harmless — the dashboard has its own `package-lock.json` inside the pnpm monorepo. Can be silenced by adding `turbopack.root` to `next.config.ts`.
 ### API Routes Return Empty Data
 - **Ollama offline:** Start with `ollama serve` or `brew services start ollama`
 - **Whisper not installed:** Run `brew install whisper-cpp`
 - **No models:** Check `ollama list` and `ls ~/whisper-models/`
 ---
 ## General macOS Issues
 ### Accessibility / Permissions
 Some tools (e.g., `whisper-stream` for mic access) need explicit macOS permissions:
 System Settings → Privacy & Security → Microphone → enable Terminal / your IDE
 ### Node.js TLS Warning
 ```
 Warning: Setting NODE_TLS_REJECT_UNAUTHORIZED to '0' makes TLS connections insecure
 ```
 This is set in the corporate environment to handle Forcepoint proxy. Harmless for local dev.