docs(local-llm): add multimodal stack, model recommendations, and troubleshooting

- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2),
  audio pipeline architecture, video understanding status, Kimi alternatives,
  complete local AI stack diagram
- docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general,
  reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB,
  use-case quick reference, hardware scaling guide
- docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning,
  JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio
  format conversion, proxy-corrupted downloads detection
This commit is contained in:
saravanakumardb1 2026-02-19 13:01:22 -08:00
parent 80f794dee7
commit 3561deee52
3 changed files with 498 additions and 0 deletions

View File

@ -0,0 +1,171 @@
# 04 — Multimodal Local Stack
> Vision models, audio pipeline architecture, and video understanding status.
---
## Overview
A fully local multimodal AI stack requires three pipelines:
```
Audio In → whisper.cpp (STT) → text
Image In → vision LLM (Ollama) → text description / analysis
Video In → frame extraction → vision LLM per frame → text
Text → LLM (Ollama) → response
Text Out → TTS (optional) → audio
```
---
## Vision Models (Image Understanding) ✅ Available
These run on Ollama and accept image input alongside text prompts.
### Available on Ollama
| Model | Size | RAM Needed | Capability |
| ----------------- | ------ | ----------------------- | ------------------------------- |
| `qwen2.5vl:72b` | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally |
| `llava:34b` | ~22 GB | ~22 GB | Strong image understanding, OCR |
| `llava:13b` | ~9 GB | ~9 GB | Good balance |
| `llava-llama3:8b` | ~6 GB | ~6 GB | LLaVA on Llama3 base |
| `minicpm-v:8b` | ~6 GB | ~6 GB | Strong vision + OCR |
| `qwen2.5vl:7b` | ~6 GB | ~6 GB | Qwen vision — very capable |
| `moondream2` | ~2 GB | ~2 GB | Tiny, fast, basic vision |
### Recommended for M4 Pro 48 GB
- **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB)
- **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools
- **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive
### Pull & Use
```bash
# Pull a vision model
ollama pull llava:34b
# Use with an image
ollama run llava:34b "Describe this image" --images /path/to/image.png
```
### Vision API
```bash
curl http://localhost:11434/api/generate -d '{
"model": "llava:34b",
"prompt": "What do you see in this image?",
"images": ["<base64-encoded-image>"]
}'
```
---
## Audio Pipeline ✅ Available (via Whisper.cpp)
No Ollama models handle audio input natively. Audio requires a separate pipeline:
```
Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS
```
### Components
| Stage | Tool | Status |
| ------------------ | --------------------------------------------- | ---------------------------- |
| **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model |
| **LLM Processing** | Ollama (any model) | ✅ Ready |
| **Text-to-Speech** | kokoro / piper (local TTS) | Not installed |
### Real-Time Voice → LLM Pipeline
```bash
# whisper-talk-llama does this in one binary!
whisper-talk-llama \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--llama-model ~/.ollama/models/... \
--language en
```
> Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured.
### Local TTS Options (Not Yet Installed)
| Tool | Quality | Speed | Install |
| ------------- | --------- | --------- | ------------------------ |
| **Kokoro** | Excellent | Fast | `pip install kokoro` |
| **Piper** | Good | Very fast | `brew install piper` |
| **espeak-ng** | Basic | Instant | `brew install espeak-ng` |
---
## Video Understanding ❌ Not Available Locally
**True video understanding is not yet available on local models.** Current state:
- No end-to-end video model runs on Ollama
- Workaround: Extract frames with ffmpeg → process each frame with a vision model
- Practical for screenshots/thumbnails, not for real video understanding
### Frame Extraction Workaround
```bash
# Extract 1 frame per second from video
ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg
# Then process each frame with vision model
for f in frames/*.jpg; do
ollama run llava:34b "Describe what's happening" --images "$f"
done
```
### For Real Video Understanding
Use cloud APIs:
- **Google Gemini** — native video input support
- **OpenAI GPT-4o** — accepts video frames
- **Anthropic Claude** — accepts images (frame extraction needed)
---
## Kimi (Moonshot AI) — Cloud Only
Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists.
### Closest Local Alternatives
| Looking For | Local Alternative | Why Similar |
| ------------------------ | ------------------- | ------------------------------ |
| Kimi k1.5 (reasoning) | `deepseek-r1:32b` | Strong reasoning, long context |
| Kimi long context (128k) | `qwen2.5:72b` | Same tier, 128k context |
| Kimi coding | `qwen2.5-coder:32b` | Best local coding model |
```bash
ollama pull deepseek-r1:32b # 20 GB, reasoning-focused
```
---
## Complete Local AI Stack Summary
```
┌─────────────────────────────────────────────────────┐
│ Local AI Stack (M4 Pro 48GB) │
├─────────────────────────────────────────────────────┤
│ │
│ Audio In ──→ whisper-cpp (STT) ──→ Text │
│ │ │
│ Image In ──→ llava:34b (Vision) ──→ Text │
│ │ │
│ Text ──────→ qwen2.5-coder:32b ──→ Response │
│ │ │
│ Response ──→ kokoro/piper (TTS) ──→ Audio Out │
│ │
│ Server: Ollama :11434 │
│ Dashboard: Mission Control :3100 │
│ Whisper Server: whisper-server :8080 (optional) │
│ │
└─────────────────────────────────────────────────────┘
```

View File

@ -0,0 +1,106 @@
# 07 — Model Recommendations
> Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.
---
## Tier 1 — Best Overall Coding Models
| Model | Size | RAM Used | Pull Command | Notes |
| ----------------------- | ----- | -------- | ------------------------------- | -------------------------------------------------- |
| **`qwen2.5-coder:32b`** | 19 GB | ~22 GB | `ollama pull qwen2.5-coder:32b` | **Top pick** — rivals GPT-4o on code, 128k context |
| `deepseek-coder-v2:16b` | 10 GB | ~12 GB | `ollama pull deepseek-coder-v2` | Best open-source coding model at 16B |
| `codestral:22b` | 13 GB | ~15 GB | `ollama pull codestral` | Mistral's coding model, very fast completions |
## Tier 2 — Fast & Capable (Speed/Quality Balance)
| Model | Size | RAM Used | Pull Command | Notes |
| ---------------------- | ---- | -------- | --------------------------------- | --------------------------------------------- |
| **`qwen2.5-coder:7b`** | 5 GB | ~6 GB | `ollama pull qwen2.5-coder:7b` | Fast, surprisingly good for TS/Python/Swift |
| `deepseek-coder:6.7b` | 4 GB | ~5 GB | `ollama pull deepseek-coder:6.7b` | Lightweight, solid everyday coding |
| `codegemma:7b` | 5 GB | ~6 GB | `ollama pull codegemma:7b` | Google's model, decent but outclassed by Qwen |
## Tier 3 — General Purpose (Also Good at Code)
| Model | Size | RAM Used | Pull Command | Notes |
| ------------------- | ------ | -------- | -------------------------- | ----------------------------------- |
| `llama3.1:70b` (Q4) | 40 GB | ~42 GB | `ollama pull llama3.1:70b` | Best general model — tight on 48 GB |
| `llama3.1:8b` | 4.9 GB | ~6 GB | `ollama pull llama3.1:8b` | Fast, good for evals |
| `mistral-nemo:12b` | 7 GB | ~9 GB | `ollama pull mistral-nemo` | Fast reasoning |
| `phi4:14b` | 9 GB | ~11 GB | `ollama pull phi4` | Strong reasoning, fits comfortably |
## Tier 4 — Reasoning & Deep Thinking
| Model | Size | RAM Used | Pull Command | Notes |
| --------------------- | ----- | -------- | ----------------------------- | ------------------------------------------------ |
| **`deepseek-r1:32b`** | 20 GB | ~22 GB | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning, closest to Kimi k1.5 |
| `deepseek-r1:7b` | 5 GB | ~6 GB | `ollama pull deepseek-r1:7b` | Lightweight reasoning |
## Tier 5 — Vision (Multimodal)
| Model | Size | RAM Used | Pull Command | Notes |
| -------------- | ----- | -------- | -------------------------- | ------------------------ |
| `llava:34b` | 22 GB | ~22 GB | `ollama pull llava:34b` | Image understanding, OCR |
| `qwen2.5vl:7b` | 6 GB | ~6 GB | `ollama pull qwen2.5vl:7b` | Qwen vision, fast |
| `minicpm-v:8b` | 6 GB | ~6 GB | `ollama pull minicpm-v:8b` | Strong OCR |
| `moondream2` | 2 GB | ~2 GB | `ollama pull moondream2` | Tiny, basic vision |
## Tier 6 — Embeddings
| Model | Size | RAM Used | Pull Command | Notes |
| ------------------- | ------ | -------- | ------------------------------- | ------------------------- |
| `nomic-embed-text` | 0.3 GB | ~0.5 GB | `ollama pull nomic-embed-text` | Good for semantic search |
| `mxbai-embed-large` | 0.7 GB | ~1 GB | `ollama pull mxbai-embed-large` | Higher quality embeddings |
---
## Recommended 10-Model Stack for M4 Pro 48 GB
For maximum coverage across all use cases:
| # | Model | Disk | Use Case |
| --- | ----------------------- | ----------- | ---------------------------------------- |
| 1 | `qwen2.5-coder:32b` | 19 GB | **Primary** — coding (TS, Python, Swift) |
| 2 | `qwen2.5-coder:7b` | 5 GB | Fast coding completions |
| 3 | `deepseek-coder-v2:16b` | 10 GB | Alternative coding model |
| 4 | `llama3.1:8b` | 4.9 GB | Eval default, general tasks |
| 5 | `deepseek-r1:32b` | 20 GB | Deep reasoning, complex triage |
| 6 | `codestral:22b` | 13 GB | Fast code completions (Mistral) |
| 7 | `phi4:14b` | 9 GB | Reasoning, structured output |
| 8 | `llava:34b` | 22 GB | Vision / image understanding |
| 9 | `mistral-nemo:12b` | 7 GB | Fast general purpose |
| 10 | `nomic-embed-text` | 0.3 GB | Embeddings / semantic search |
| | **Total** | **~115 GB** | |
Only one loads into RAM at a time. You can have all 10 on disk simultaneously.
---
## By Use Case (Quick Reference)
| Use Case | Best Model | Fallback |
| -------------------------- | ------------------- | ----------------------- |
| **TypeScript/ESM coding** | `qwen2.5-coder:32b` | `qwen2.5-coder:7b` |
| **Python coding** | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` |
| **Swift/iOS coding** | `qwen2.5-coder:32b` | `codestral:22b` |
| **Extraction evals** | `llama3.1:8b` | `qwen2.5:7b` |
| **JSON structured output** | `qwen2.5:7b` | `qwen2.5-coder:7b` |
| **Complex reasoning** | `deepseek-r1:32b` | `phi4:14b` |
| **Image understanding** | `llava:34b` | `qwen2.5vl:7b` |
| **Embeddings** | `nomic-embed-text` | `mxbai-embed-large` |
| **Fast iteration** | `qwen2.5-coder:7b` | `llama3.1:8b` |
---
## Hardware Guide (General)
For reference if running on different hardware:
| RAM | Max Model Size | Recommendation |
| ------ | -------------- | ------------------------------------- |
| 8 GB | 7B | `qwen2.5-coder:7b` |
| 16 GB | 13-16B | `deepseek-coder-v2:16b` |
| 24 GB | 32B | `qwen2.5-coder:32b` |
| 32 GB | 32B + headroom | `qwen2.5-coder:32b` (comfortable) |
| 48 GB | 70B (Q4) | `llama3.1:70b` or any 32B comfortably |
| 64 GB+ | 70B (Q8) | Full precision 70B models |

View File

@ -0,0 +1,221 @@
# 08 — Troubleshooting & Corporate Proxy
> Common issues, Forcepoint proxy workarounds, MLX warnings, and fixes.
---
## Corporate Proxy (Forcepoint CertChecker)
This machine is behind an AT&T Forcepoint proxy that performs SSL deep packet inspection.
### Proxy Details
| Setting | Value |
| --------- | --------------------------------------------- |
| Proxy URL | `http://cso.proxy.att.com:8080/` |
| Agent | Forcepoint CertChecker |
| Impact | Intercepts HTTPS, replaces TLS certificates |
| Env vars | `HTTP_PROXY`, `HTTPS_PROXY` set automatically |
### What Works Through Proxy
| Tool | Status | Notes |
| -------------------------- | ---------- | ------------------------------------- |
| `ollama pull` | ✅ Works | Ollama handles proxy natively |
| `brew install` | ✅ Works | Homebrew handles proxy |
| `npm install` | ✅ Works | With `NODE_TLS_REJECT_UNAUTHORIZED=0` |
| `curl` to Hugging Face | ❌ Blocked | Returns 19 KB HTML redirect page |
| `curl -k` to Hugging Face | ❌ Blocked | Still intercepted even with `-k` |
| `python requests` to HF | ❌ Blocked | SSL_CERTIFICATE_VERIFY_FAILED |
| `huggingface_hub` download | ❌ Blocked | Falls back to cached (broken) files |
### Workaround: Download Off-Network
For Hugging Face model downloads (e.g., Whisper GGML files):
1. **Disconnect** from corporate VPN/Wi-Fi
2. **Connect** to personal hotspot or home Wi-Fi
3. Run the download:
```bash
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
```
4. **Reconnect** to corporate. The model is stored locally forever.
### Detecting a Proxy-Corrupted Download
If a download completed but the file is suspiciously small:
```bash
# Check file size (should be ~1.6 GB for large-v3-turbo)
ls -lh ~/whisper-models/ggml-large-v3-turbo.bin
# Check file type (should NOT be HTML)
file ~/whisper-models/ggml-large-v3-turbo.bin
# If it says "HTML document text" — delete and re-download off-network
rm ~/whisper-models/ggml-large-v3-turbo.bin
```
---
## Ollama Issues
### `MLX dynamic library not available`
```
WARN MLX dynamic library not available error="failed to load MLX dynamic library"
```
**Severity:** Harmless
**Cause:** Ollama searches for Apple MLX framework but it's not installed
**Impact:** None — falls back to Metal backend which is fully functional on M4 Pro
**Fix:** None needed. Ignore the warning.
### Model Pull Fails (SSL / Proxy)
```bash
# Try bypassing proxy for Ollama registry
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
```
### Ollama Not Responding
```bash
# Check if running
curl http://localhost:11434/api/tags
# Restart
brew services restart ollama
# or
pkill ollama && ollama serve
```
### JSON Parse Errors in Evals
Model returned markdown-wrapped JSON (` ```json ... ``` `). Fix by adding to your prompt:
```
Return ONLY a valid JSON object — no markdown, no backticks, no explanation.
```
### Slow Inference
Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only:
```bash
# Restart with performance flags
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
```
### Model Won't Unload from RAM
```bash
# Force unload via API
curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "prompt": "", "keep_alive": "0"}'
# Or restart Ollama entirely
brew services restart ollama
```
### Disk Space Running Low
```bash
# Check Ollama disk usage
du -sh ~/.ollama/models/
# List models with sizes
ollama list
# Remove models you don't need
ollama rm <model-name>
```
---
## Whisper.cpp Issues
### `command not found: whisper-cpp`
The binary is named `whisper-cli`, NOT `whisper-cpp`:
```bash
# Wrong
whisper-cpp --model ...
# Correct
whisper-cli --model ...
```
Full list of binaries: `ls /opt/homebrew/bin/whisper-*`
### Audio Format Not Supported
Whisper.cpp requires WAV format. Convert first:
```bash
# m4a → wav (16kHz mono)
ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav
# mp3 → wav
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
```
### Model File Is HTML (Proxy-Corrupted)
```bash
file ~/whisper-models/ggml-large-v3-turbo.bin
# If output says "HTML document text" — it's corrupted
rm ~/whisper-models/ggml-large-v3-turbo.bin
# Re-download off corporate network
```
### `ffmpeg: command not found`
```bash
brew install ffmpeg
```
---
## Dashboard Issues
### Port Conflict
```bash
# Default port 3100, change if needed
npm run dev -- -p 3101
```
### Lockfile Warning
```
Warning: Next.js inferred your workspace root, but it may not be correct.
We detected multiple lockfiles...
```
This is harmless — the dashboard has its own `package-lock.json` inside the pnpm monorepo. Can be silenced by adding `turbopack.root` to `next.config.ts`.
### API Routes Return Empty Data
- **Ollama offline:** Start with `ollama serve` or `brew services start ollama`
- **Whisper not installed:** Run `brew install whisper-cpp`
- **No models:** Check `ollama list` and `ls ~/whisper-models/`
---
## General macOS Issues
### Accessibility / Permissions
Some tools (e.g., `whisper-stream` for mic access) need explicit macOS permissions:
System Settings → Privacy & Security → Microphone → enable Terminal / your IDE
### Node.js TLS Warning
```
Warning: Setting NODE_TLS_REJECT_UNAUTHORIZED to '0' makes TLS connections insecure
```
This is set in the corporate environment to handle Forcepoint proxy. Harmless for local dev.