- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2), audio pipeline architecture, video understanding status, Kimi alternatives, complete local AI stack diagram - docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general, reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB, use-case quick reference, hardware scaling guide - docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning, JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio format conversion, proxy-corrupted downloads detection
6.4 KiB
6.4 KiB
04 — Multimodal Local Stack
Vision models, audio pipeline architecture, and video understanding status.
Overview
A fully local multimodal AI stack requires three pipelines:
Audio In → whisper.cpp (STT) → text
Image In → vision LLM (Ollama) → text description / analysis
Video In → frame extraction → vision LLM per frame → text
Text → LLM (Ollama) → response
Text Out → TTS (optional) → audio
Vision Models (Image Understanding) ✅ Available
These run on Ollama and accept image input alongside text prompts.
Available on Ollama
| Model | Size | RAM Needed | Capability |
|---|---|---|---|
qwen2.5vl:72b |
~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally |
llava:34b |
~22 GB | ~22 GB | Strong image understanding, OCR |
llava:13b |
~9 GB | ~9 GB | Good balance |
llava-llama3:8b |
~6 GB | ~6 GB | LLaVA on Llama3 base |
minicpm-v:8b |
~6 GB | ~6 GB | Strong vision + OCR |
qwen2.5vl:7b |
~6 GB | ~6 GB | Qwen vision — very capable |
moondream2 |
~2 GB | ~2 GB | Tiny, fast, basic vision |
Recommended for M4 Pro 48 GB
- Best quality:
qwen2.5vl:72b— possible but tight (~45 GB, leaves 3 GB) - Safe choice:
llava:34b— 22 GB, leaves 26 GB free for other tools - Fast:
qwen2.5vl:7b— 6 GB, very responsive
Pull & Use
# Pull a vision model
ollama pull llava:34b
# Use with an image
ollama run llava:34b "Describe this image" --images /path/to/image.png
Vision API
curl http://localhost:11434/api/generate -d '{
"model": "llava:34b",
"prompt": "What do you see in this image?",
"images": ["<base64-encoded-image>"]
}'
Audio Pipeline ✅ Available (via Whisper.cpp)
No Ollama models handle audio input natively. Audio requires a separate pipeline:
Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS
Components
| Stage | Tool | Status |
|---|---|---|
| Speech-to-Text | whisper-cpp (whisper-cli, whisper-stream) |
✅ Installed, awaiting model |
| LLM Processing | Ollama (any model) | ✅ Ready |
| Text-to-Speech | kokoro / piper (local TTS) | Not installed |
Real-Time Voice → LLM Pipeline
# whisper-talk-llama does this in one binary!
whisper-talk-llama \
--model ~/whisper-models/ggml-large-v3-turbo.bin \
--llama-model ~/.ollama/models/... \
--language en
Note:
whisper-talk-llamais installed but requires both Whisper + LLaMA GGUF models configured.
Local TTS Options (Not Yet Installed)
| Tool | Quality | Speed | Install |
|---|---|---|---|
| Kokoro | Excellent | Fast | pip install kokoro |
| Piper | Good | Very fast | brew install piper |
| espeak-ng | Basic | Instant | brew install espeak-ng |
Video Understanding ❌ Not Available Locally
True video understanding is not yet available on local models. Current state:
- No end-to-end video model runs on Ollama
- Workaround: Extract frames with ffmpeg → process each frame with a vision model
- Practical for screenshots/thumbnails, not for real video understanding
Frame Extraction Workaround
# Extract 1 frame per second from video
ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg
# Then process each frame with vision model
for f in frames/*.jpg; do
ollama run llava:34b "Describe what's happening" --images "$f"
done
For Real Video Understanding
Use cloud APIs:
- Google Gemini — native video input support
- OpenAI GPT-4o — accepts video frames
- Anthropic Claude — accepts images (frame extraction needed)
Kimi (Moonshot AI) — Cloud Only
Kimi models are not available on Ollama — they're cloud-only via the Moonshot API (api.moonshot.cn). No official Ollama-compatible release exists.
Closest Local Alternatives
| Looking For | Local Alternative | Why Similar |
|---|---|---|
| Kimi k1.5 (reasoning) | deepseek-r1:32b |
Strong reasoning, long context |
| Kimi long context (128k) | qwen2.5:72b |
Same tier, 128k context |
| Kimi coding | qwen2.5-coder:32b |
Best local coding model |
ollama pull deepseek-r1:32b # 20 GB, reasoning-focused
Complete Local AI Stack Summary
┌─────────────────────────────────────────────────────┐
│ Local AI Stack (M4 Pro 48GB) │
├─────────────────────────────────────────────────────┤
│ │
│ Audio In ──→ whisper-cpp (STT) ──→ Text │
│ │ │
│ Image In ──→ llava:34b (Vision) ──→ Text │
│ │ │
│ Text ──────→ qwen2.5-coder:32b ──→ Response │
│ │ │
│ Response ──→ kokoro/piper (TTS) ──→ Audio Out │
│ │
│ Server: Ollama :11434 │
│ Dashboard: Mission Control :3100 │
│ Whisper Server: whisper-server :8080 (optional) │
│ │
└─────────────────────────────────────────────────────┘