# 07 — Model Recommendations

> Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.

---

## Tier 1 — Best Overall Coding Models

| Model                   | Size  | RAM Used | Pull Command                    | Notes                                              |
| ----------------------- | ----- | -------- | ------------------------------- | -------------------------------------------------- |
| **`qwen2.5-coder:32b`** | 19 GB | ~22 GB   | `ollama pull qwen2.5-coder:32b` | **Top pick** — rivals GPT-4o on code, 128k context |
| `deepseek-coder-v2:16b` | 10 GB | ~12 GB   | `ollama pull deepseek-coder-v2` | Best open-source coding model at 16B               |
| `codestral:22b`         | 13 GB | ~15 GB   | `ollama pull codestral`         | Mistral's coding model, very fast completions      |

## Tier 2 — Fast & Capable (Speed/Quality Balance)

| Model                  | Size | RAM Used | Pull Command                      | Notes                                         |
| ---------------------- | ---- | -------- | --------------------------------- | --------------------------------------------- |
| **`qwen2.5-coder:7b`** | 5 GB | ~6 GB    | `ollama pull qwen2.5-coder:7b`    | Fast, surprisingly good for TS/Python/Swift   |
| `deepseek-coder:6.7b`  | 4 GB | ~5 GB    | `ollama pull deepseek-coder:6.7b` | Lightweight, solid everyday coding            |
| `codegemma:7b`         | 5 GB | ~6 GB    | `ollama pull codegemma:7b`        | Google's model, decent but outclassed by Qwen |

## Tier 3 — General Purpose (Also Good at Code)

| Model               | Size   | RAM Used | Pull Command               | Notes                               |
| ------------------- | ------ | -------- | -------------------------- | ----------------------------------- |
| `llama3.1:70b` (Q4) | 40 GB  | ~42 GB   | `ollama pull llama3.1:70b` | Best general model — tight on 48 GB |
| `llama3.1:8b`       | 4.9 GB | ~6 GB    | `ollama pull llama3.1:8b`  | Fast, good for evals                |
| `mistral-nemo:12b`  | 7 GB   | ~9 GB    | `ollama pull mistral-nemo` | Fast reasoning                      |
| `phi4:14b`          | 9 GB   | ~11 GB   | `ollama pull phi4`         | Strong reasoning, fits comfortably  |

## Tier 4 — Reasoning & Deep Thinking

| Model                 | Size  | Parameters | Quant  | RAM Used | Pull Command                  | Notes                                                                                                                             |
| --------------------- | ----- | ---------- | ------ | -------- | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **`deepseek-r1:32b`** | 20 GB | 32B        | Q4_K_M | ~22 GB   | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning — emits `<think>` blocks before JSON output; ~75–80% of llama3.3:70b reasoning quality at half the RAM |
| `deepseek-r1:7b`      | 5 GB  | 7B         | Q4_K_M | ~6 GB    | `ollama pull deepseek-r1:7b`  | Lightweight reasoning, good for quick triage                                                                                      |

> **⚠️ JSON output note:** DeepSeek R1 models emit `<think>...</think>` reasoning traces before the JSON response. Strip these before `JSON.parse()` — see [06-extraction-service-evals.md](06-extraction-service-evals.md) for the transform pattern.

## Tier 5 — Vision (Multimodal)

| Model          | Size  | RAM Used | Pull Command               | Notes                    |
| -------------- | ----- | -------- | -------------------------- | ------------------------ |
| `llava:34b`    | 22 GB | ~22 GB   | `ollama pull llava:34b`    | Image understanding, OCR |
| `qwen2.5vl:7b` | 6 GB  | ~6 GB    | `ollama pull qwen2.5vl:7b` | Qwen vision, fast        |
| `minicpm-v:8b` | 6 GB  | ~6 GB    | `ollama pull minicpm-v:8b` | Strong OCR               |
| `moondream2`   | 2 GB  | ~2 GB    | `ollama pull moondream2`   | Tiny, basic vision       |

## Tier 6 — Embeddings

| Model               | Size   | RAM Used | Pull Command                    | Notes                     |
| ------------------- | ------ | -------- | ------------------------------- | ------------------------- |
| `nomic-embed-text`  | 0.3 GB | ~0.5 GB  | `ollama pull nomic-embed-text`  | Good for semantic search  |
| `mxbai-embed-large` | 0.7 GB | ~1 GB    | `ollama pull mxbai-embed-large` | Higher quality embeddings |

---

## Recommended 10-Model Stack for M4 Pro 48 GB

For maximum coverage across all use cases:

| #   | Model                   | Disk        | Use Case                                 |
| --- | ----------------------- | ----------- | ---------------------------------------- |
| 1   | `qwen2.5-coder:32b`     | 19 GB       | **Primary** — coding (TS, Python, Swift) |
| 2   | `qwen2.5-coder:7b`      | 5 GB        | Fast coding completions                  |
| 3   | `deepseek-coder-v2:16b` | 10 GB       | Alternative coding model                 |
| 4   | `llama3.1:8b`           | 4.9 GB      | Eval default, general tasks              |
| 5   | `deepseek-r1:32b`       | 20 GB       | Deep reasoning, complex triage           |
| 6   | `codestral:22b`         | 13 GB       | Fast code completions (Mistral)          |
| 7   | `phi4:14b`              | 9 GB        | Reasoning, structured output             |
| 8   | `llava:34b`             | 22 GB       | Vision / image understanding             |
| 9   | `mistral-nemo:12b`      | 7 GB        | Fast general purpose                     |
| 10  | `nomic-embed-text`      | 0.3 GB      | Embeddings / semantic search             |
|     | **Total**               | **~115 GB** |                                          |

Only one loads into RAM at a time. You can have all 10 on disk simultaneously.

---

## By Use Case (Quick Reference)

| Use Case                       | Best Model          | Fallback                |
| ------------------------------ | ------------------- | ----------------------- |
| **TypeScript/ESM coding**      | `qwen2.5-coder:32b` | `qwen2.5-coder:7b`      |
| **Python coding**              | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` |
| **Swift/iOS coding**           | `qwen2.5-coder:32b` | `codestral:22b`         |
| **Extraction evals**           | `llama3.1:8b`       | `qwen2.5-coder:32b`     |
| **JSON structured output**     | `qwen2.5-coder:32b` | `qwen2.5:7b`            |
| **Complex reasoning / triage** | `deepseek-r1:32b`   | `phi4:14b`              |
| **Brain signal routing**       | `deepseek-r1:32b`   | `qwen2.5-coder:32b`     |
| **Image understanding**        | `llava:34b`         | `qwen2.5vl:7b`          |
| **Embeddings**                 | `nomic-embed-text`  | `mxbai-embed-large`     |
| **Fast iteration / dev evals** | `llama3.1:8b`       | `qwen2.5-coder:7b`      |

---

## Comprehensive Model Comparison Table

All models discussed — detailed capability reference for M4 Pro 48 GB:

| Model                   | Disk    | Params | Quant  | RAM    | Tok/s     | JSON           | Reasoning  | Code       | Instruction Following | Context | `<think>` | Status on this machine     |
| ----------------------- | ------- | ------ | ------ | ------ | --------- | -------------- | ---------- | ---------- | --------------------- | ------- | --------- | -------------------------- |
| `llama3.1:8b`           | 4.9 GB  | 8B     | Q4_K_M | ~6 GB  | 40–60     | ✅ Good        | ⭐⭐       | ⭐⭐       | ⭐⭐⭐                | 128k    | ❌        | ✅ Installed               |
| `qwen2.5-coder:32b`     | 18.5 GB | 32.8B  | Q4_K_M | ~22 GB | 15–25     | ✅ Excellent   | ⭐⭐⭐     | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐            | 128k    | ❌        | ✅ Installed               |
| `deepseek-r1:32b`       | 20 GB   | 32B    | Q4_K_M | ~22 GB | 12–20     | ⚠️ Needs strip | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐   | ⭐⭐⭐⭐              | 128k    | ✅ Yes    | 🔲 Not installed           |
| `llama3.3:70b` (Q4)     | 40 GB   | 70B    | Q4_K_M | ~42 GB | 5–10      | ✅ Excellent   | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐   | ⭐⭐⭐⭐⭐            | 128k    | ❌        | ⚠️ Tight (6GB left for OS) |
| `qwen2.5:7b`            | 5 GB    | 7B     | Q4_K_M | ~6 GB  | 40–60     | ✅ Excellent   | ⭐⭐       | ⭐⭐⭐     | ⭐⭐⭐⭐              | 128k    | ❌        | 🔲 Not installed           |
| `deepseek-r1:7b`        | 5 GB    | 7B     | Q4_K_M | ~6 GB  | 35–50     | ⚠️ Needs strip | ⭐⭐⭐⭐   | ⭐⭐⭐     | ⭐⭐⭐                | 128k    | ✅ Yes    | 🔲 Not installed           |
| `phi4:14b`              | 9 GB    | 14B    | Q4_K_M | ~11 GB | 25–35     | ✅ Good        | ⭐⭐⭐⭐   | ⭐⭐⭐     | ⭐⭐⭐⭐              | 16k     | ❌        | 🔲 Not installed           |
| `deepseek-coder-v2:16b` | 10 GB   | 16B    | Q4_K_M | ~12 GB | 25–35     | ✅ Good        | ⭐⭐⭐     | ⭐⭐⭐⭐   | ⭐⭐⭐⭐              | 128k    | ❌        | 🔲 Not installed           |
| `codestral:22b`         | 13 GB   | 22B    | Q4_K_M | ~15 GB | 20–30     | ✅ Good        | ⭐⭐⭐     | ⭐⭐⭐⭐   | ⭐⭐⭐⭐              | 32k     | ❌        | 🔲 Not installed           |
| `gemini-2.5-flash`      | —       | —      | Cloud  | —      | ~1s/req   | ✅ Excellent   | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐            | 1M      | ❌        | ☁️ Cloud ($0.003/run)      |
| `gpt-4o`                | —       | —      | Cloud  | —      | ~1–2s/req | ✅ Excellent   | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐            | 128k    | ❌        | ☁️ Cloud ($0.05–0.15/run)  |

### Column Key

| Column                    | Meaning                                                              |
| ------------------------- | -------------------------------------------------------------------- |
| **Tok/s**                 | Tokens per second on M4 Pro 48 GB (Metal backend)                    |
| **JSON**                  | Reliability of structured JSON output compliance                     |
| **Reasoning**             | Multi-step / chain-of-thought quality (⭐ = weak, ⭐⭐⭐⭐⭐ = best) |
| **Code**                  | Code generation quality across TS/Python/Swift                       |
| **Instruction Following** | Adherence to output format constraints                               |
| **`<think>`**             | Emits reasoning traces before output (needs stripping for JSON)      |

### Gap Analysis vs llama3.3:70b (cloud-quality ceiling locally)

| Gap                    | `llama3.1:8b` | `qwen2.5-coder:32b` | `deepseek-r1:32b` |
| ---------------------- | :-----------: | :-----------------: | :---------------: |
| Multi-step reasoning   |     ~40%      |        ~65%         |       ~80%        |
| Strict JSON compliance |     ~75%      |        ~95%         |      ~70%\*       |
| Brain signal routing   |     ~60%      |        ~80%         |       ~90%        |
| Code generation        |     ~55%      |        ~95%         |       ~80%        |
| Instruction following  |     ~70%      |        ~90%         |       ~85%        |
| **Overall vs 70B**     |   **~55%**    |      **~85%**       |    **~75–80%**    |

\*With `<think>` strip transform applied

---

## Hardware Guide (General)

For reference if running on different hardware:

| RAM    | Max Model Size | Recommendation                        |
| ------ | -------------- | ------------------------------------- |
| 8 GB   | 7B             | `qwen2.5-coder:7b`                    |
| 16 GB  | 13-16B         | `deepseek-coder-v2:16b`               |
| 24 GB  | 32B            | `qwen2.5-coder:32b`                   |
| 32 GB  | 32B + headroom | `qwen2.5-coder:32b` (comfortable)     |
| 48 GB  | 70B (Q4)       | `llama3.1:70b` or any 32B comfortably |
| 64 GB+ | 70B (Q8)       | Full precision 70B models             |