# 04 — Multimodal Local Stack > Vision models, audio pipeline architecture, and video understanding status. --- ## Overview A fully local multimodal AI stack requires three pipelines: ``` Audio In → whisper.cpp (STT) → text Image In → vision LLM (Ollama) → text description / analysis Video In → frame extraction → vision LLM per frame → text Text → LLM (Ollama) → response Text Out → TTS (optional) → audio ``` --- ## Vision Models (Image Understanding) ✅ Available These run on Ollama and accept image input alongside text prompts. ### Available on Ollama | Model | Size | RAM Needed | Capability | | ----------------- | ------ | ----------------------- | ------------------------------- | | `qwen2.5vl:72b` | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally | | `llava:34b` | ~22 GB | ~22 GB | Strong image understanding, OCR | | `llava:13b` | ~9 GB | ~9 GB | Good balance | | `llava-llama3:8b` | ~6 GB | ~6 GB | LLaVA on Llama3 base | | `minicpm-v:8b` | ~6 GB | ~6 GB | Strong vision + OCR | | `qwen2.5vl:7b` | ~6 GB | ~6 GB | Qwen vision — very capable | | `moondream2` | ~2 GB | ~2 GB | Tiny, fast, basic vision | ### Recommended for M4 Pro 48 GB - **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB) - **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools - **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive ### Pull & Use ```bash # Pull a vision model ollama pull llava:34b # Use with an image ollama run llava:34b "Describe this image" --images /path/to/image.png ``` ### Vision API ```bash curl http://localhost:11434/api/generate -d '{ "model": "llava:34b", "prompt": "What do you see in this image?", "images": [""] }' ``` --- ## Audio Pipeline ✅ Available (via Whisper.cpp) No Ollama models handle audio input natively. Audio requires a separate pipeline: ``` Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS ``` ### Components | Stage | Tool | Status | | ------------------ | --------------------------------------------- | ---------------------------- | | **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model | | **LLM Processing** | Ollama (any model) | ✅ Ready | | **Text-to-Speech** | kokoro / piper (local TTS) | Not installed | ### Real-Time Voice → LLM Pipeline ```bash # whisper-talk-llama does this in one binary! whisper-talk-llama \ --model ~/whisper-models/ggml-large-v3-turbo.bin \ --llama-model ~/.ollama/models/... \ --language en ``` > Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured. ### Local TTS Options (Not Yet Installed) | Tool | Quality | Speed | Install | | ------------- | --------- | --------- | ------------------------ | | **Kokoro** | Excellent | Fast | `pip install kokoro` | | **Piper** | Good | Very fast | `brew install piper` | | **espeak-ng** | Basic | Instant | `brew install espeak-ng` | --- ## Video Understanding ❌ Not Available Locally **True video understanding is not yet available on local models.** Current state: - No end-to-end video model runs on Ollama - Workaround: Extract frames with ffmpeg → process each frame with a vision model - Practical for screenshots/thumbnails, not for real video understanding ### Frame Extraction Workaround ```bash # Extract 1 frame per second from video ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg # Then process each frame with vision model for f in frames/*.jpg; do ollama run llava:34b "Describe what's happening" --images "$f" done ``` ### For Real Video Understanding Use cloud APIs: - **Google Gemini** — native video input support - **OpenAI GPT-4o** — accepts video frames - **Anthropic Claude** — accepts images (frame extraction needed) --- ## Kimi (Moonshot AI) — Cloud Only Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists. ### Closest Local Alternatives | Looking For | Local Alternative | Why Similar | | ------------------------ | ------------------- | ------------------------------ | | Kimi k1.5 (reasoning) | `deepseek-r1:32b` | Strong reasoning, long context | | Kimi long context (128k) | `qwen2.5:72b` | Same tier, 128k context | | Kimi coding | `qwen2.5-coder:32b` | Best local coding model | ```bash ollama pull deepseek-r1:32b # 20 GB, reasoning-focused ``` --- ## Complete Local AI Stack Summary ``` ┌─────────────────────────────────────────────────────┐ │ Local AI Stack (M4 Pro 48GB) │ ├─────────────────────────────────────────────────────┤ │ │ │ Audio In ──→ whisper-cpp (STT) ──→ Text │ │ │ │ │ Image In ──→ llava:34b (Vision) ──→ Text │ │ │ │ │ Text ──────→ qwen2.5-coder:32b ──→ Response │ │ │ │ │ Response ──→ kokoro/piper (TTS) ──→ Audio Out │ │ │ │ Server: Ollama :11434 │ │ Dashboard: Mission Control :3100 │ │ Whisper Server: whisper-server :8080 (optional) │ │ │ └─────────────────────────────────────────────────────┘ ```