# 04 — Multimodal Local Stack

> Vision models, audio pipeline architecture, and video understanding status.

---

## Overview

A fully local multimodal AI stack requires three pipelines:

```
Audio In  →  whisper.cpp (STT)   →  text
Image In  →  vision LLM (Ollama) →  text description / analysis
Video In  →  frame extraction    →  vision LLM per frame → text
Text      →  LLM (Ollama)        →  response
Text Out  →  TTS (optional)      →  audio
```

---

## Vision Models (Image Understanding) ✅ Available

These run on Ollama and accept image input alongside text prompts.

### Available on Ollama

| Model             | Size   | RAM Needed              | Capability                      |
| ----------------- | ------ | ----------------------- | ------------------------------- |
| `qwen2.5vl:72b`   | ~45 GB | ~45 GB ⚠️ tight on 48GB | Best vision locally             |
| `llava:34b`       | ~22 GB | ~22 GB                  | Strong image understanding, OCR |
| `llava:13b`       | ~9 GB  | ~9 GB                   | Good balance                    |
| `llava-llama3:8b` | ~6 GB  | ~6 GB                   | LLaVA on Llama3 base            |
| `minicpm-v:8b`    | ~6 GB  | ~6 GB                   | Strong vision + OCR             |
| `qwen2.5vl:7b`    | ~6 GB  | ~6 GB                   | Qwen vision — very capable      |
| `moondream2`      | ~2 GB  | ~2 GB                   | Tiny, fast, basic vision        |

### Recommended for M4 Pro 48 GB

- **Best quality:** `qwen2.5vl:72b` — possible but tight (~45 GB, leaves 3 GB)
- **Safe choice:** `llava:34b` — 22 GB, leaves 26 GB free for other tools
- **Fast:** `qwen2.5vl:7b` — 6 GB, very responsive

### Pull & Use

```bash
# Pull a vision model
ollama pull llava:34b

# Use with an image
ollama run llava:34b "Describe this image" --images /path/to/image.png
```

### Vision API

```bash
curl http://localhost:11434/api/generate -d '{
  "model": "llava:34b",
  "prompt": "What do you see in this image?",
  "images": ["<base64-encoded-image>"]
}'
```

---

## Audio Pipeline ✅ Available (via Whisper.cpp)

No Ollama models handle audio input natively. Audio requires a separate pipeline:

```
Microphone → whisper.cpp (local STT) → LLM (Ollama) → optional TTS
```

### Components

| Stage              | Tool                                          | Status                       |
| ------------------ | --------------------------------------------- | ---------------------------- |
| **Speech-to-Text** | whisper-cpp (`whisper-cli`, `whisper-stream`) | ✅ Installed, awaiting model |
| **LLM Processing** | Ollama (any model)                            | ✅ Ready                     |
| **Text-to-Speech** | kokoro / piper (local TTS)                    | Not installed                |

### Real-Time Voice → LLM Pipeline

```bash
# whisper-talk-llama does this in one binary!
whisper-talk-llama \
  --model ~/whisper-models/ggml-large-v3-turbo.bin \
  --llama-model ~/.ollama/models/... \
  --language en
```

> Note: `whisper-talk-llama` is installed but requires both Whisper + LLaMA GGUF models configured.

### Local TTS Options (Not Yet Installed)

| Tool          | Quality   | Speed     | Install                  |
| ------------- | --------- | --------- | ------------------------ |
| **Kokoro**    | Excellent | Fast      | `pip install kokoro`     |
| **Piper**     | Good      | Very fast | `brew install piper`     |
| **espeak-ng** | Basic     | Instant   | `brew install espeak-ng` |

---

## Video Understanding ❌ Not Available Locally

**True video understanding is not yet available on local models.** Current state:

- No end-to-end video model runs on Ollama
- Workaround: Extract frames with ffmpeg → process each frame with a vision model
- Practical for screenshots/thumbnails, not for real video understanding

### Frame Extraction Workaround

```bash
# Extract 1 frame per second from video
ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.jpg

# Then process each frame with vision model
for f in frames/*.jpg; do
  ollama run llava:34b "Describe what's happening" --images "$f"
done
```

### For Real Video Understanding

Use cloud APIs:

- **Google Gemini** — native video input support
- **OpenAI GPT-4o** — accepts video frames
- **Anthropic Claude** — accepts images (frame extraction needed)

---

## Kimi (Moonshot AI) — Cloud Only

Kimi models are **not available on Ollama** — they're cloud-only via the Moonshot API (`api.moonshot.cn`). No official Ollama-compatible release exists.

### Closest Local Alternatives

| Looking For              | Local Alternative   | Why Similar                    |
| ------------------------ | ------------------- | ------------------------------ |
| Kimi k1.5 (reasoning)    | `deepseek-r1:32b`   | Strong reasoning, long context |
| Kimi long context (128k) | `qwen2.5:72b`       | Same tier, 128k context        |
| Kimi coding              | `qwen2.5-coder:32b` | Best local coding model        |

```bash
ollama pull deepseek-r1:32b   # 20 GB, reasoning-focused
```

---

## Complete Local AI Stack Summary

```
┌─────────────────────────────────────────────────────┐
│              Local AI Stack (M4 Pro 48GB)            │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Audio In ──→ whisper-cpp (STT) ──→ Text            │
│                                      │              │
│  Image In ──→ llava:34b (Vision) ──→ Text           │
│                                      │              │
│  Text ──────→ qwen2.5-coder:32b ──→ Response        │
│                                      │              │
│  Response ──→ kokoro/piper (TTS) ──→ Audio Out      │
│                                                     │
│  Server: Ollama :11434                              │
│  Dashboard: Mission Control :3100                   │
│  Whisper Server: whisper-server :8080 (optional)    │
│                                                     │
└─────────────────────────────────────────────────────┘
```