diff --git a/__LOCAL_LLMs/docs/07-model-recommendations.md b/__LOCAL_LLMs/docs/07-model-recommendations.md index c64ea1c7..f31e66d6 100644 --- a/__LOCAL_LLMs/docs/07-model-recommendations.md +++ b/__LOCAL_LLMs/docs/07-model-recommendations.md @@ -31,10 +31,12 @@ ## Tier 4 — Reasoning & Deep Thinking -| Model | Size | RAM Used | Pull Command | Notes | -| --------------------- | ----- | -------- | ----------------------------- | ------------------------------------------------ | -| **`deepseek-r1:32b`** | 20 GB | ~22 GB | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning, closest to Kimi k1.5 | -| `deepseek-r1:7b` | 5 GB | ~6 GB | `ollama pull deepseek-r1:7b` | Lightweight reasoning | +| Model | Size | Parameters | Quant | RAM Used | Pull Command | Notes | +| --------------------- | ----- | ---------- | ------ | -------- | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| **`deepseek-r1:32b`** | 20 GB | 32B | Q4_K_M | ~22 GB | `ollama pull deepseek-r1:32b` | Chain-of-thought reasoning — emits `` blocks before JSON output; ~75–80% of llama3.3:70b reasoning quality at half the RAM | +| `deepseek-r1:7b` | 5 GB | 7B | Q4_K_M | ~6 GB | `ollama pull deepseek-r1:7b` | Lightweight reasoning, good for quick triage | + +> **⚠️ JSON output note:** DeepSeek R1 models emit `...` reasoning traces before the JSON response. Strip these before `JSON.parse()` — see [06-extraction-service-evals.md](06-extraction-service-evals.md) for the transform pattern. ## Tier 5 — Vision (Multimodal) @@ -78,17 +80,62 @@ Only one loads into RAM at a time. You can have all 10 on disk simultaneously. ## By Use Case (Quick Reference) -| Use Case | Best Model | Fallback | -| -------------------------- | ------------------- | ----------------------- | -| **TypeScript/ESM coding** | `qwen2.5-coder:32b` | `qwen2.5-coder:7b` | -| **Python coding** | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` | -| **Swift/iOS coding** | `qwen2.5-coder:32b` | `codestral:22b` | -| **Extraction evals** | `llama3.1:8b` | `qwen2.5:7b` | -| **JSON structured output** | `qwen2.5:7b` | `qwen2.5-coder:7b` | -| **Complex reasoning** | `deepseek-r1:32b` | `phi4:14b` | -| **Image understanding** | `llava:34b` | `qwen2.5vl:7b` | -| **Embeddings** | `nomic-embed-text` | `mxbai-embed-large` | -| **Fast iteration** | `qwen2.5-coder:7b` | `llama3.1:8b` | +| Use Case | Best Model | Fallback | +| ------------------------------ | ------------------- | ----------------------- | +| **TypeScript/ESM coding** | `qwen2.5-coder:32b` | `qwen2.5-coder:7b` | +| **Python coding** | `qwen2.5-coder:32b` | `deepseek-coder-v2:16b` | +| **Swift/iOS coding** | `qwen2.5-coder:32b` | `codestral:22b` | +| **Extraction evals** | `llama3.1:8b` | `qwen2.5-coder:32b` | +| **JSON structured output** | `qwen2.5-coder:32b` | `qwen2.5:7b` | +| **Complex reasoning / triage** | `deepseek-r1:32b` | `phi4:14b` | +| **Brain signal routing** | `deepseek-r1:32b` | `qwen2.5-coder:32b` | +| **Image understanding** | `llava:34b` | `qwen2.5vl:7b` | +| **Embeddings** | `nomic-embed-text` | `mxbai-embed-large` | +| **Fast iteration / dev evals** | `llama3.1:8b` | `qwen2.5-coder:7b` | + +--- + +## Comprehensive Model Comparison Table + +All models discussed — detailed capability reference for M4 Pro 48 GB: + +| Model | Disk | Params | Quant | RAM | Tok/s | JSON | Reasoning | Code | Instruction Following | Context | `` | Status on this machine | +| ----------------------- | ------- | ------ | ------ | ------ | --------- | -------------- | ---------- | ---------- | --------------------- | ------- | --------- | -------------------------- | +| `llama3.1:8b` | 4.9 GB | 8B | Q4_K_M | ~6 GB | 40–60 | ✅ Good | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ | 128k | ❌ | ✅ Installed | +| `qwen2.5-coder:32b` | 18.5 GB | 32.8B | Q4_K_M | ~22 GB | 15–25 | ✅ Excellent | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 128k | ❌ | ✅ Installed | +| `deepseek-r1:32b` | 20 GB | 32B | Q4_K_M | ~22 GB | 12–20 | ⚠️ Needs strip | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 128k | ✅ Yes | 🔲 Not installed | +| `llama3.3:70b` (Q4) | 40 GB | 70B | Q4_K_M | ~42 GB | 5–10 | ✅ Excellent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 128k | ❌ | ⚠️ Tight (6GB left for OS) | +| `qwen2.5:7b` | 5 GB | 7B | Q4_K_M | ~6 GB | 40–60 | ✅ Excellent | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | 128k | ❌ | 🔲 Not installed | +| `deepseek-r1:7b` | 5 GB | 7B | Q4_K_M | ~6 GB | 35–50 | ⚠️ Needs strip | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 128k | ✅ Yes | 🔲 Not installed | +| `phi4:14b` | 9 GB | 14B | Q4_K_M | ~11 GB | 25–35 | ✅ Good | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | 16k | ❌ | 🔲 Not installed | +| `deepseek-coder-v2:16b` | 10 GB | 16B | Q4_K_M | ~12 GB | 25–35 | ✅ Good | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 128k | ❌ | 🔲 Not installed | +| `codestral:22b` | 13 GB | 22B | Q4_K_M | ~15 GB | 20–30 | ✅ Good | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 32k | ❌ | 🔲 Not installed | +| `gemini-2.5-flash` | — | — | Cloud | — | ~1s/req | ✅ Excellent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 1M | ❌ | ☁️ Cloud ($0.003/run) | +| `gpt-4o` | — | — | Cloud | — | ~1–2s/req | ✅ Excellent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 128k | ❌ | ☁️ Cloud ($0.05–0.15/run) | + +### Column Key + +| Column | Meaning | +| ------------------------- | -------------------------------------------------------------------- | +| **Tok/s** | Tokens per second on M4 Pro 48 GB (Metal backend) | +| **JSON** | Reliability of structured JSON output compliance | +| **Reasoning** | Multi-step / chain-of-thought quality (⭐ = weak, ⭐⭐⭐⭐⭐ = best) | +| **Code** | Code generation quality across TS/Python/Swift | +| **Instruction Following** | Adherence to output format constraints | +| **``** | Emits reasoning traces before output (needs stripping for JSON) | + +### Gap Analysis vs llama3.3:70b (cloud-quality ceiling locally) + +| Gap | `llama3.1:8b` | `qwen2.5-coder:32b` | `deepseek-r1:32b` | +| ---------------------- | :-----------: | :-----------------: | :---------------: | +| Multi-step reasoning | ~40% | ~65% | ~80% | +| Strict JSON compliance | ~75% | ~95% | ~70%\* | +| Brain signal routing | ~60% | ~80% | ~90% | +| Code generation | ~55% | ~95% | ~80% | +| Instruction following | ~70% | ~90% | ~85% | +| **Overall vs 70B** | **~55%** | **~85%** | **~75–80%** | + +\*With `` strip transform applied ---