learning_ai_common_plat/__LOCAL_LLMs/docs/04-multimodal-local-stack.md
saravanakumardb1 3561deee52 docs(local-llm): add multimodal stack, model recommendations, and troubleshooting
- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2),
  audio pipeline architecture, video understanding status, Kimi alternatives,
  complete local AI stack diagram
- docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general,
  reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB,
  use-case quick reference, hardware scaling guide
- docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning,
  JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio
  format conversion, proxy-corrupted downloads detection
2026-02-19 13:01:22 -08:00

6.4 KiB

04 — Multimodal Local Stack

Vision models, audio pipeline architecture, and video understanding status.


Overview

A fully local multimodal AI stack requires three pipelines:

Audio In  →  whisper.cpp (STT)   →  text
Image In  →  vision LLM (Ollama) →  text description / analysis
Video In  →  frame extraction    →  vision LLM per frame → text
Text      →  LLM (Ollama)        →  response
Text Out  →  TTS (optional)      →  audio

Vision Models (Image Understanding) Available

These run on Ollama and accept image input alongside text prompts.

Available on Ollama

Model Size RAM Needed Capability
qwen2.5vl:72b ~45 GB ~45 GB ⚠️ tight on 48GB Best vision locally
llava:34b ~22 GB ~22 GB Strong image understanding, OCR
llava:13b ~9 GB ~9 GB Good balance
llava-llama3:8b ~6 GB ~6 GB LLaVA on Llama3 base
minicpm-v:8b ~6 GB ~6 GB Strong vision + OCR
qwen2.5vl:7b ~6 GB ~6 GB Qwen vision — very capable
moondream2 ~2 GB ~2 GB Tiny, fast, basic vision
  • Best quality: qwen2.5vl:72b — possible but tight (~45 GB, leaves 3 GB)
  • Safe choice: llava:34b — 22 GB, leaves 26 GB free for other tools
  • Fast: qwen2.5vl:7b — 6 GB, very responsive

Pull & Use

# Pull a vision model
ollama pull llava:34b

# Use with an image
ollama run llava:34b "Describe this image" --images /path/to/image.png

Vision API

curl http://localhost:11434/api/generate -d '{
  "model": "llava:34b",
  "prompt": "What do you see in this image?",
  "images": ["<base64-encoded-image>"]
}'

Audio Pipeline Available (via Whisper.cpp)

No Ollama models handle audio input natively. Audio requires a separate pipeline:

Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS

Components

Stage Tool Status
Speech-to-Text whisper-cpp (whisper-cli, whisper-stream) Installed, awaiting model
LLM Processing Ollama (any model) Ready
Text-to-Speech kokoro / piper (local TTS) Not installed

Real-Time Voice → LLM Pipeline

# whisper-talk-llama does this in one binary!
whisper-talk-llama \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --llama-model ~/.ollama/models/... \
  --language en

Note: whisper-talk-llama is installed but requires both Whisper + LLaMA GGUF models configured.

Local TTS Options (Not Yet Installed)

Tool Quality Speed Install
Kokoro Excellent Fast pip install kokoro
Piper Good Very fast brew install piper
espeak-ng Basic Instant brew install espeak-ng

Video Understanding Not Available Locally

True video understanding is not yet available on local models. Current state:

  • No end-to-end video model runs on Ollama
  • Workaround: Extract frames with ffmpeg → process each frame with a vision model
  • Practical for screenshots/thumbnails, not for real video understanding

Frame Extraction Workaround

# Extract 1 frame per second from video
ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg

# Then process each frame with vision model
for f in frames/*.jpg; do
  ollama run llava:34b "Describe what's happening" --images "$f"
done

For Real Video Understanding

Use cloud APIs:

  • Google Gemini — native video input support
  • OpenAI GPT-4o — accepts video frames
  • Anthropic Claude — accepts images (frame extraction needed)

Kimi (Moonshot AI) — Cloud Only

Kimi models are not available on Ollama — they're cloud-only via the Moonshot API (api.moonshot.cn). No official Ollama-compatible release exists.

Closest Local Alternatives

Looking For Local Alternative Why Similar
Kimi k1.5 (reasoning) deepseek-r1:32b Strong reasoning, long context
Kimi long context (128k) qwen2.5:72b Same tier, 128k context
Kimi coding qwen2.5-coder:32b Best local coding model
ollama pull deepseek-r1:32b   # 20 GB, reasoning-focused

Complete Local AI Stack Summary

┌─────────────────────────────────────────────────────┐
│              Local AI Stack (M4 Pro 48GB)            │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Audio In ──→ whisper-cpp (STT) ──→ Text            │
│                                      │              │
│  Image In ──→ llava:34b (Vision) ──→ Text           │
│                                      │              │
│  Text ──────→ qwen2.5-coder:32b ──→ Response        │
│                                      │              │
│  Response ──→ kokoro/piper (TTS) ──→ Audio Out      │
│                                                     │
│  Server: Ollama :11434                              │
│  Dashboard: Mission Control :3100                   │
│  Whisper Server: whisper-server :8080 (optional)    │
│                                                     │
└─────────────────────────────────────────────────────┘