From 3561deee52166658e0eb3ddd26241f6cc38dadd8 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Thu, 19 Feb 2026 13:01:22 -0800 Subject: [PATCH] docs(local-llm): add multimodal stack, model recommendations, and troubleshooting - docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2), audio pipeline architecture, video understanding status, Kimi alternatives, complete local AI stack diagram - docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general, reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB, use-case quick reference, hardware scaling guide - docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning, JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio format conversion, proxy-corrupted downloads detection --- .../docs/04-multimodal-local-stack.md | 171 ++++++++++++++ __LOCAL_LLMs/docs/07-model-recommendations.md | 106 +++++++++ __LOCAL_LLMs/docs/08-troubleshooting.md | 221 ++++++++++++++++++ 3 files changed, 498 insertions(+) create mode 100644 __LOCAL_LLMs/docs/04-multimodal-local-stack.md create mode 100644 __LOCAL_LLMs/docs/07-model-recommendations.md create mode 100644 __LOCAL_LLMs/docs/08-troubleshooting.md diff --git a/__LOCAL_LLMs/docs/04-multimodal-local-stack.md b/__LOCAL_LLMs/docs/04-multimodal-local-stack.md new file mode 100644 index 00000000..c4d3f4d3 --- /dev/null +++ b/__LOCAL_LLMs/docs/04-multimodal-local-stack.md @@ -0,0 +1,171 @@ +# 04 — Multimodal Local Stack + +> Vision models, audio pipeline architecture, and video understanding status. + +--- + +## Overview + +A fully local multimodal AI stack requires three pipelines: + +``` +Audio In → whisper.cpp (STT) → text +Image In → vision LLM (Ollama) → text description / analysis +Video In → frame extraction → vision LLM per frame → text +Text → LLM (Ollama) → response +Text Out → TTS (optional) → audio +``` + +--- + +## Vision Models (Image Understanding) ✅ Available + +These run on Ollama and accept image input alongside text prompts. + +### Available on Ollama + +| Model | Size | RAM Needed | Capability | +| ----------------- | ------ | ----------------------- | ------------------------------- | +| `qwen2.5vl:72b` | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally | +| `llava:34b` | ~22 GB | ~22 GB | Strong image understanding, OCR | +| `llava:13b` | ~9 GB | ~9 GB | Good balance | +| `llava-llama3:8b` | ~6 GB | ~6 GB | LLaVA on Llama3 base | +| `minicpm-v:8b` | ~6 GB | ~6 GB | Strong vision + OCR | +| `qwen2.5vl:7b` | ~6 GB | ~6 GB | Qwen vision — very capable | +| `moondream2` | ~2 GB | ~2 GB | Tiny, fast, basic vision | + +### Recommended for M4 Pro 48 GB + +- **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB) +- **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools +- **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive + +### Pull & Use + +```bash +# Pull a vision model +ollama pull llava:34b + +# Use with an image +ollama run llava:34b "Describe this image" --images /path/to/image.png +``` + +### Vision API + +```bash +curl http://localhost:11434/api/generate -d '{ + "model": "llava:34b", + "prompt": "What do you see in this image?", + "images": [""] +}' +``` + +--- + +## Audio Pipeline ✅ Available (via Whisper.cpp) + +No Ollama models handle audio input natively. Audio requires a separate pipeline: + +``` +Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS +``` + +### Components + +| Stage | Tool | Status | +| ------------------ | --------------------------------------------- | ---------------------------- | +| **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model | +| **LLM Processing** | Ollama (any model) | ✅ Ready | +| **Text-to-Speech** | kokoro / piper (local TTS) | Not installed | + +### Real-Time Voice → LLM Pipeline + +```bash +# whisper-talk-llama does this in one binary! +whisper-talk-llama \ + --model ~/whisper-models/ggml-large-v3-turbo.bin \ + --llama-model ~/.ollama/models/... \ + --language en +``` + +> Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured. + +### Local TTS Options (Not Yet Installed) + +| Tool | Quality | Speed | Install | +| ------------- | --------- | --------- | ------------------------ | +| **Kokoro** | Excellent | Fast | `pip install kokoro` | +| **Piper** | Good | Very fast | `brew install piper` | +| **espeak-ng** | Basic | Instant | `brew install espeak-ng` | + +--- + +## Video Understanding ❌ Not Available Locally + +**True video understanding is not yet available on local models.** Current state: + +- No end-to-end video model runs on Ollama +- Workaround: Extract frames with ffmpeg → process each frame with a vision model +- Practical for screenshots/thumbnails, not for real video understanding + +### Frame Extraction Workaround + +```bash +# Extract 1 frame per second from video +ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg + +# Then process each frame with vision model +for f in frames/*.jpg; do + ollama run llava:34b "Describe what's happening" --images "$f" +done +``` + +### For Real Video Understanding + +Use cloud APIs: + +- **Google Gemini** — native video input support +- **OpenAI GPT-4o** — accepts video frames +- **Anthropic Claude** — accepts images (frame extraction needed) + +--- + +## Kimi (Moonshot AI) — Cloud Only + +Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists. + +### Closest Local Alternatives + +| Looking For | Local Alternative | Why Similar | +| ------------------------ | ------------------- | ------------------------------ | +| Kimi k1.5 (reasoning) | `deepseek-r1:32b` | Strong reasoning, long context | +| Kimi long context (128k) | `qwen2.5:72b` | Same tier, 128k context | +| Kimi coding | `qwen2.5-coder:32b` | Best local coding model | + +```bash +ollama pull deepseek-r1:32b # 20 GB, reasoning-focused +``` + +--- + +## Complete Local AI Stack Summary + +``` +┌─────────────────────────────────────────────────────┐ +│ Local AI Stack (M4 Pro 48GB) │ +├─────────────────────────────────────────────────────┤ +│ │ +│ Audio In ──→ whisper-cpp (STT) ──→ Text │ +│ │ │ +│ Image In ──→ llava:34b (Vision) ──→ Text │ +│ │ │ +│ Text ──────→ qwen2.5-coder:32b ──→ Response │ +│ │ │ +│ Response ──→ kokoro/piper (TTS) ──→ Audio Out │ +│ │ +│ Server: Ollama :11434 │ +│ Dashboard: Mission Control :3100 │ +│ Whisper Server: whisper-server :8080 (optional) │ +│ │ +└─────────────────────────────────────────────────────┘ +``` diff --git a/__LOCAL_LLMs/docs/07-model-recommendations.md b/__LOCAL_LLMs/docs/07-model-recommendations.md new file mode 100644 index 00000000..c64ea1c7 --- /dev/null +++ b/__LOCAL_LLMs/docs/07-model-recommendations.md @@ -0,0 +1,106 @@ +# 07 — Model Recommendations + +> Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory. + +--- + +## Tier 1 — Best Overall Coding Models + +| Model | Size | RAM Used | Pull Command | Notes | +| ----------------------- | ----- | -------- | ------------------------------- | -------------------------------------------------- | +| **`qwen2.5-coder:32b`** | 19 GB | ~22 GB | `ollama pull qwen2.5-coder:32b` | **Top pick** — rivals GPT-4o on code, 128k context | +| `deepseek-coder-v2:16b` | 10 GB | ~12 GB | `ollama pull deepseek-coder-v2` | Best open-source coding model at 16B | +| `codestral:22b` | 13 GB | ~15 GB | `ollama pull codestral` | Mistral's coding model, very fast completions | + +## Tier 2 — Fast & Capable (Speed/Quality Balance) + +| Model | Size | RAM Used | Pull Command | Notes | +| ---------------------- | ---- | -------- | --------------------------------- | --------------------------------------------- | +| **`qwen2.5-coder:7b`** | 5 GB | ~6 GB | `ollama pull qwen2.5-coder:7b` | Fast, surprisingly good for TS/Python/Swift | +| `deepseek-coder:6.7b` | 4 GB | ~5 GB | `ollama pull deepseek-coder:6.7b` | Lightweight, solid everyday coding | +| `codegemma:7b` | 5 GB | ~6 GB | `ollama pull codegemma:7b` | Google's model, decent but outclassed by Qwen | + +## Tier 3 — General Purpose (Also Good at Code) + +| Model | Size | RAM Used | Pull Command | Notes | +| ------------------- | ------ | -------- | -------------------------- | ----------------------------------- | +| `llama3.1:70b` (Q4) | 40 GB | ~42 GB | `ollama pull llama3.1:70b` | Best general model — tight on 48 GB | +| `llama3.1:8b` | 4.9 GB | ~6 GB | `ollama pull llama3.1:8b` | Fast, good for evals | +| `mistral-nemo:12b` | 7 GB | ~9 GB | `ollama pull mistral-nemo` | Fast reasoning | +| `phi4:14b` | 9 GB | ~11 GB | `ollama pull phi4` | Strong reasoning, fits comfortably | + +## Tier 4 — Reasoning & Deep Thinking + +| Model | Size | RAM Used | Pull Command | Notes | +| --------------------- | ----- | -------- | ----------------------------- | ------------------------------------------------ | +| **`deepseek-r1:32b`** | 20 GB | ~22 GB | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning, closest to Kimi k1.5 | +| `deepseek-r1:7b` | 5 GB | ~6 GB | `ollama pull deepseek-r1:7b` | Lightweight reasoning | + +## Tier 5 — Vision (Multimodal) + +| Model | Size | RAM Used | Pull Command | Notes | +| -------------- | ----- | -------- | -------------------------- | ------------------------ | +| `llava:34b` | 22 GB | ~22 GB | `ollama pull llava:34b` | Image understanding, OCR | +| `qwen2.5vl:7b` | 6 GB | ~6 GB | `ollama pull qwen2.5vl:7b` | Qwen vision, fast | +| `minicpm-v:8b` | 6 GB | ~6 GB | `ollama pull minicpm-v:8b` | Strong OCR | +| `moondream2` | 2 GB | ~2 GB | `ollama pull moondream2` | Tiny, basic vision | + +## Tier 6 — Embeddings + +| Model | Size | RAM Used | Pull Command | Notes | +| ------------------- | ------ | -------- | ------------------------------- | ------------------------- | +| `nomic-embed-text` | 0.3 GB | ~0.5 GB | `ollama pull nomic-embed-text` | Good for semantic search | +| `mxbai-embed-large` | 0.7 GB | ~1 GB | `ollama pull mxbai-embed-large` | Higher quality embeddings | + +--- + +## Recommended 10-Model Stack for M4 Pro 48 GB + +For maximum coverage across all use cases: + +| # | Model | Disk | Use Case | +| --- | ----------------------- | ----------- | ---------------------------------------- | +| 1 | `qwen2.5-coder:32b` | 19 GB | **Primary** — coding (TS, Python, Swift) | +| 2 | `qwen2.5-coder:7b` | 5 GB | Fast coding completions | +| 3 | `deepseek-coder-v2:16b` | 10 GB | Alternative coding model | +| 4 | `llama3.1:8b` | 4.9 GB | Eval default, general tasks | +| 5 | `deepseek-r1:32b` | 20 GB | Deep reasoning, complex triage | +| 6 | `codestral:22b` | 13 GB | Fast code completions (Mistral) | +| 7 | `phi4:14b` | 9 GB | Reasoning, structured output | +| 8 | `llava:34b` | 22 GB | Vision / image understanding | +| 9 | `mistral-nemo:12b` | 7 GB | Fast general purpose | +| 10 | `nomic-embed-text` | 0.3 GB | Embeddings / semantic search | +| | **Total** | **~115 GB** | | + +Only one loads into RAM at a time. You can have all 10 on disk simultaneously. + +--- + +## By Use Case (Quick Reference) + +| Use Case | Best Model | Fallback | +| -------------------------- | ------------------- | ----------------------- | +| **TypeScript/ESM coding** | `qwen2.5-coder:32b` | `qwen2.5-coder:7b` | +| **Python coding** | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` | +| **Swift/iOS coding** | `qwen2.5-coder:32b` | `codestral:22b` | +| **Extraction evals** | `llama3.1:8b` | `qwen2.5:7b` | +| **JSON structured output** | `qwen2.5:7b` | `qwen2.5-coder:7b` | +| **Complex reasoning** | `deepseek-r1:32b` | `phi4:14b` | +| **Image understanding** | `llava:34b` | `qwen2.5vl:7b` | +| **Embeddings** | `nomic-embed-text` | `mxbai-embed-large` | +| **Fast iteration** | `qwen2.5-coder:7b` | `llama3.1:8b` | + +--- + +## Hardware Guide (General) + +For reference if running on different hardware: + +| RAM | Max Model Size | Recommendation | +| ------ | -------------- | ------------------------------------- | +| 8 GB | 7B | `qwen2.5-coder:7b` | +| 16 GB | 13-16B | `deepseek-coder-v2:16b` | +| 24 GB | 32B | `qwen2.5-coder:32b` | +| 32 GB | 32B + headroom | `qwen2.5-coder:32b` (comfortable) | +| 48 GB | 70B (Q4) | `llama3.1:70b` or any 32B comfortably | +| 64 GB+ | 70B (Q8) | Full precision 70B models | diff --git a/__LOCAL_LLMs/docs/08-troubleshooting.md b/__LOCAL_LLMs/docs/08-troubleshooting.md new file mode 100644 index 00000000..b70bdade --- /dev/null +++ b/__LOCAL_LLMs/docs/08-troubleshooting.md @@ -0,0 +1,221 @@ +# 08 — Troubleshooting & Corporate Proxy + +> Common issues, Forcepoint proxy workarounds, MLX warnings, and fixes. + +--- + +## Corporate Proxy (Forcepoint CertChecker) + +This machine is behind an AT&T Forcepoint proxy that performs SSL deep packet inspection. + +### Proxy Details + +| Setting | Value | +| --------- | --------------------------------------------- | +| Proxy URL | `http://cso.proxy.att.com:8080/` | +| Agent | Forcepoint CertChecker | +| Impact | Intercepts HTTPS, replaces TLS certificates | +| Env vars | `HTTP_PROXY`, `HTTPS_PROXY` set automatically | + +### What Works Through Proxy + +| Tool | Status | Notes | +| -------------------------- | ---------- | ------------------------------------- | +| `ollama pull` | ✅ Works | Ollama handles proxy natively | +| `brew install` | ✅ Works | Homebrew handles proxy | +| `npm install` | ✅ Works | With `NODE_TLS_REJECT_UNAUTHORIZED=0` | +| `curl` to Hugging Face | ❌ Blocked | Returns 19 KB HTML redirect page | +| `curl -k` to Hugging Face | ❌ Blocked | Still intercepted even with `-k` | +| `python requests` to HF | ❌ Blocked | SSL_CERTIFICATE_VERIFY_FAILED | +| `huggingface_hub` download | ❌ Blocked | Falls back to cached (broken) files | + +### Workaround: Download Off-Network + +For Hugging Face model downloads (e.g., Whisper GGML files): + +1. **Disconnect** from corporate VPN/Wi-Fi +2. **Connect** to personal hotspot or home Wi-Fi +3. Run the download: + ```bash + curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \ + https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin + ``` +4. **Reconnect** to corporate. The model is stored locally forever. + +### Detecting a Proxy-Corrupted Download + +If a download completed but the file is suspiciously small: + +```bash +# Check file size (should be ~1.6 GB for large-v3-turbo) +ls -lh ~/whisper-models/ggml-large-v3-turbo.bin + +# Check file type (should NOT be HTML) +file ~/whisper-models/ggml-large-v3-turbo.bin + +# If it says "HTML document text" — delete and re-download off-network +rm ~/whisper-models/ggml-large-v3-turbo.bin +``` + +--- + +## Ollama Issues + +### `MLX dynamic library not available` + +``` +WARN MLX dynamic library not available error="failed to load MLX dynamic library" +``` + +**Severity:** Harmless +**Cause:** Ollama searches for Apple MLX framework but it's not installed +**Impact:** None — falls back to Metal backend which is fully functional on M4 Pro +**Fix:** None needed. Ignore the warning. + +### Model Pull Fails (SSL / Proxy) + +```bash +# Try bypassing proxy for Ollama registry +NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b +``` + +### Ollama Not Responding + +```bash +# Check if running +curl http://localhost:11434/api/tags + +# Restart +brew services restart ollama +# or +pkill ollama && ollama serve +``` + +### JSON Parse Errors in Evals + +Model returned markdown-wrapped JSON (` ```json ... ``` `). Fix by adding to your prompt: + +``` +Return ONLY a valid JSON object — no markdown, no backticks, no explanation. +``` + +### Slow Inference + +Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only: + +```bash +# Restart with performance flags +OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve +``` + +### Model Won't Unload from RAM + +```bash +# Force unload via API +curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "prompt": "", "keep_alive": "0"}' + +# Or restart Ollama entirely +brew services restart ollama +``` + +### Disk Space Running Low + +```bash +# Check Ollama disk usage +du -sh ~/.ollama/models/ + +# List models with sizes +ollama list + +# Remove models you don't need +ollama rm +``` + +--- + +## Whisper.cpp Issues + +### `command not found: whisper-cpp` + +The binary is named `whisper-cli`, NOT `whisper-cpp`: + +```bash +# Wrong +whisper-cpp --model ... + +# Correct +whisper-cli --model ... +``` + +Full list of binaries: `ls /opt/homebrew/bin/whisper-*` + +### Audio Format Not Supported + +Whisper.cpp requires WAV format. Convert first: + +```bash +# m4a → wav (16kHz mono) +ffmpeg -i input.m4a -ar 16000 -ac 1 output.wav + +# mp3 → wav +ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav +``` + +### Model File Is HTML (Proxy-Corrupted) + +```bash +file ~/whisper-models/ggml-large-v3-turbo.bin +# If output says "HTML document text" — it's corrupted +rm ~/whisper-models/ggml-large-v3-turbo.bin +# Re-download off corporate network +``` + +### `ffmpeg: command not found` + +```bash +brew install ffmpeg +``` + +--- + +## Dashboard Issues + +### Port Conflict + +```bash +# Default port 3100, change if needed +npm run dev -- -p 3101 +``` + +### Lockfile Warning + +``` +Warning: Next.js inferred your workspace root, but it may not be correct. +We detected multiple lockfiles... +``` + +This is harmless — the dashboard has its own `package-lock.json` inside the pnpm monorepo. Can be silenced by adding `turbopack.root` to `next.config.ts`. + +### API Routes Return Empty Data + +- **Ollama offline:** Start with `ollama serve` or `brew services start ollama` +- **Whisper not installed:** Run `brew install whisper-cpp` +- **No models:** Check `ollama list` and `ls ~/whisper-models/` + +--- + +## General macOS Issues + +### Accessibility / Permissions + +Some tools (e.g., `whisper-stream` for mic access) need explicit macOS permissions: + +System Settings → Privacy & Security → Microphone → enable Terminal / your IDE + +### Node.js TLS Warning + +``` +Warning: Setting NODE_TLS_REJECT_UNAUTHORIZED to '0' makes TLS connections insecure +``` + +This is set in the corporate environment to handle Forcepoint proxy. Harmless for local dev.