saravanakumardb1 3561deee52 docs(local-llm): add multimodal stack, model recommendations, and troubleshooting

- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2),
  audio pipeline architecture, video understanding status, Kimi alternatives,
  complete local AI stack diagram
- docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general,
  reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB,
  use-case quick reference, hardware scaling guide
- docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning,
  JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio
  format conversion, proxy-corrupted downloads detection

2026-02-19 13:01:22 -08:00

6.4 KiB

Raw Blame History

04 — Multimodal Local Stack

Vision models, audio pipeline architecture, and video understanding status.

Overview

A fully local multimodal AI stack requires three pipelines:

Audio In  →  whisper.cpp (STT)   →  text
Image In  →  vision LLM (Ollama) →  text description / analysis
Video In  →  frame extraction    →  vision LLM per frame → text
Text      →  LLM (Ollama)        →  response
Text Out  →  TTS (optional)      →  audio

Vision Models (Image Understanding) ✅ Available

These run on Ollama and accept image input alongside text prompts.

Available on Ollama

Model	Size	RAM Needed	Capability
`qwen2.5vl:72b`	~45 GB	~45 GB ⚠️ tight on 48GB	Best vision locally
`llava:34b`	~22 GB	~22 GB	Strong image understanding, OCR
`llava:13b`	~9 GB	~9 GB	Good balance
`llava-llama3:8b`	~6 GB	~6 GB	LLaVA on Llama3 base
`minicpm-v:8b`	~6 GB	~6 GB	Strong vision + OCR
`qwen2.5vl:7b`	~6 GB	~6 GB	Qwen vision — very capable
`moondream2`	~2 GB	~2 GB	Tiny, fast, basic vision

Recommended for M4 Pro 48 GB

Best quality: qwen2.5vl:72b — possible but tight (~45 GB, leaves 3 GB)
Safe choice: llava:34b — 22 GB, leaves 26 GB free for other tools
Fast: qwen2.5vl:7b — 6 GB, very responsive

Pull & Use

# Pull a vision model
ollama pull llava:34b

# Use with an image
ollama run llava:34b "Describe this image" --images /path/to/image.png

Vision API

curl http://localhost:11434/api/generate -d '{
  "model": "llava:34b",
  "prompt": "What do you see in this image?",
  "images": ["<base64-encoded-image>"]
}'

Audio Pipeline ✅ Available (via Whisper.cpp)

No Ollama models handle audio input natively. Audio requires a separate pipeline:

Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS

Components

Stage	Tool	Status
Speech-to-Text	whisper-cpp (`whisper-cli`, `whisper-stream`)	✅ Installed, awaiting model
LLM Processing	Ollama (any model)	✅ Ready
Text-to-Speech	kokoro / piper (local TTS)	Not installed

Real-Time Voice → LLM Pipeline

# whisper-talk-llama does this in one binary!
whisper-talk-llama \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --llama-model ~/.ollama/models/... \
  --language en

Note: whisper-talk-llama is installed but requires both Whisper + LLaMA GGUF models configured.

Local TTS Options (Not Yet Installed)

Tool	Quality	Speed	Install
Kokoro	Excellent	Fast	`pip install kokoro`
Piper	Good	Very fast	`brew install piper`
espeak-ng	Basic	Instant	`brew install espeak-ng`

Video Understanding ❌ Not Available Locally

True video understanding is not yet available on local models. Current state:

No end-to-end video model runs on Ollama
Workaround: Extract frames with ffmpeg → process each frame with a vision model
Practical for screenshots/thumbnails, not for real video understanding

Frame Extraction Workaround

# Extract 1 frame per second from video
ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg

# Then process each frame with vision model
for f in frames/*.jpg; do
  ollama run llava:34b "Describe what's happening" --images "$f"
done

For Real Video Understanding

Use cloud APIs:

Google Gemini — native video input support
OpenAI GPT-4o — accepts video frames
Anthropic Claude — accepts images (frame extraction needed)

Kimi (Moonshot AI) — Cloud Only

Kimi models are not available on Ollama — they're cloud-only via the Moonshot API (api.moonshot.cn). No official Ollama-compatible release exists.

Closest Local Alternatives

Looking For	Local Alternative	Why Similar
Kimi k1.5 (reasoning)	`deepseek-r1:32b`	Strong reasoning, long context
Kimi long context (128k)	`qwen2.5:72b`	Same tier, 128k context
Kimi coding	`qwen2.5-coder:32b`	Best local coding model

ollama pull deepseek-r1:32b   # 20 GB, reasoning-focused

Complete Local AI Stack Summary

┌─────────────────────────────────────────────────────┐
│              Local AI Stack (M4 Pro 48GB)            │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Audio In ──→ whisper-cpp (STT) ──→ Text            │
│                                      │              │
│  Image In ──→ llava:34b (Vision) ──→ Text           │
│                                      │              │
│  Text ──────→ qwen2.5-coder:32b ──→ Response        │
│                                      │              │
│  Response ──→ kokoro/piper (TTS) ──→ Audio Out      │
│                                                     │
│  Server: Ollama :11434                              │
│  Dashboard: Mission Control :3100                   │
│  Whisper Server: whisper-server :8080 (optional)    │
│                                                     │
└─────────────────────────────────────────────────────┘

6.4 KiB Raw Blame History