learning_ai_common_plat/__LOCAL_LLMs/docs/07-model-recommendations.md
saravanakumardb1 5deb5efdcf docs(local-llms): add comprehensive model comparison table and deepseek-r1:32b details
- Add Comprehensive Model Comparison Table: 11 models (local + cloud) with
  Disk, Params, Quant, RAM, Tok/s, JSON quality, Reasoning, Code, Instruction
  Following, Context window, <think> flag, and install status columns
- Add Gap Analysis table: llama3.1:8b (~55%), qwen2.5-coder:32b (~85%),
  deepseek-r1:32b (~75-80%) vs llama3.3:70b across 5 capability dimensions
- Update Tier 4 Reasoning table: add Parameters, Quant columns; add <think>
  warning note with link to eval doc transform pattern
- Update By Use Case table: add brain signal routing row, update extraction
  evals fallback to qwen2.5-coder:32b
2026-02-19 16:06:02 -08:00

12 KiB
Raw Permalink Blame History

07 — Model Recommendations

Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.


Tier 1 — Best Overall Coding Models

Model Size RAM Used Pull Command Notes
qwen2.5-coder:32b 19 GB ~22 GB ollama pull qwen2.5-coder:32b Top pick — rivals GPT-4o on code, 128k context
deepseek-coder-v2:16b 10 GB ~12 GB ollama pull deepseek-coder-v2 Best open-source coding model at 16B
codestral:22b 13 GB ~15 GB ollama pull codestral Mistral's coding model, very fast completions

Tier 2 — Fast & Capable (Speed/Quality Balance)

Model Size RAM Used Pull Command Notes
qwen2.5-coder:7b 5 GB ~6 GB ollama pull qwen2.5-coder:7b Fast, surprisingly good for TS/Python/Swift
deepseek-coder:6.7b 4 GB ~5 GB ollama pull deepseek-coder:6.7b Lightweight, solid everyday coding
codegemma:7b 5 GB ~6 GB ollama pull codegemma:7b Google's model, decent but outclassed by Qwen

Tier 3 — General Purpose (Also Good at Code)

Model Size RAM Used Pull Command Notes
llama3.1:70b (Q4) 40 GB ~42 GB ollama pull llama3.1:70b Best general model — tight on 48 GB
llama3.1:8b 4.9 GB ~6 GB ollama pull llama3.1:8b Fast, good for evals
mistral-nemo:12b 7 GB ~9 GB ollama pull mistral-nemo Fast reasoning
phi4:14b 9 GB ~11 GB ollama pull phi4 Strong reasoning, fits comfortably

Tier 4 — Reasoning & Deep Thinking

Model Size Parameters Quant RAM Used Pull Command Notes
deepseek-r1:32b 20 GB 32B Q4_K_M ~22 GB ollama pull deepseek-r1:32b Chain-of-thought reasoning — emits <think> blocks before JSON output; ~7580% of llama3.3:70b reasoning quality at half the RAM
deepseek-r1:7b 5 GB 7B Q4_K_M ~6 GB ollama pull deepseek-r1:7b Lightweight reasoning, good for quick triage

⚠️ JSON output note: DeepSeek R1 models emit <think>...</think> reasoning traces before the JSON response. Strip these before JSON.parse() — see 06-extraction-service-evals.md for the transform pattern.

Tier 5 — Vision (Multimodal)

Model Size RAM Used Pull Command Notes
llava:34b 22 GB ~22 GB ollama pull llava:34b Image understanding, OCR
qwen2.5vl:7b 6 GB ~6 GB ollama pull qwen2.5vl:7b Qwen vision, fast
minicpm-v:8b 6 GB ~6 GB ollama pull minicpm-v:8b Strong OCR
moondream2 2 GB ~2 GB ollama pull moondream2 Tiny, basic vision

Tier 6 — Embeddings

Model Size RAM Used Pull Command Notes
nomic-embed-text 0.3 GB ~0.5 GB ollama pull nomic-embed-text Good for semantic search
mxbai-embed-large 0.7 GB ~1 GB ollama pull mxbai-embed-large Higher quality embeddings

For maximum coverage across all use cases:

# Model Disk Use Case
1 qwen2.5-coder:32b 19 GB Primary — coding (TS, Python, Swift)
2 qwen2.5-coder:7b 5 GB Fast coding completions
3 deepseek-coder-v2:16b 10 GB Alternative coding model
4 llama3.1:8b 4.9 GB Eval default, general tasks
5 deepseek-r1:32b 20 GB Deep reasoning, complex triage
6 codestral:22b 13 GB Fast code completions (Mistral)
7 phi4:14b 9 GB Reasoning, structured output
8 llava:34b 22 GB Vision / image understanding
9 mistral-nemo:12b 7 GB Fast general purpose
10 nomic-embed-text 0.3 GB Embeddings / semantic search
Total ~115 GB

Only one loads into RAM at a time. You can have all 10 on disk simultaneously.


By Use Case (Quick Reference)

Use Case Best Model Fallback
TypeScript/ESM coding qwen2.5-coder:32b qwen2.5-coder:7b
Python coding qwen2.5-coder:32b deepseek-coder-v2:16b
Swift/iOS coding qwen2.5-coder:32b codestral:22b
Extraction evals llama3.1:8b qwen2.5-coder:32b
JSON structured output qwen2.5-coder:32b qwen2.5:7b
Complex reasoning / triage deepseek-r1:32b phi4:14b
Brain signal routing deepseek-r1:32b qwen2.5-coder:32b
Image understanding llava:34b qwen2.5vl:7b
Embeddings nomic-embed-text mxbai-embed-large
Fast iteration / dev evals llama3.1:8b qwen2.5-coder:7b

Comprehensive Model Comparison Table

All models discussed — detailed capability reference for M4 Pro 48 GB:

Model Disk Params Quant RAM Tok/s JSON Reasoning Code Instruction Following Context <think> Status on this machine
llama3.1:8b 4.9 GB 8B Q4_K_M ~6 GB 4060 Good 128k Installed
qwen2.5-coder:32b 18.5 GB 32.8B Q4_K_M ~22 GB 1525 Excellent 128k Installed
deepseek-r1:32b 20 GB 32B Q4_K_M ~22 GB 1220 ⚠️ Needs strip 128k Yes 🔲 Not installed
llama3.3:70b (Q4) 40 GB 70B Q4_K_M ~42 GB 510 Excellent 128k ⚠️ Tight (6GB left for OS)
qwen2.5:7b 5 GB 7B Q4_K_M ~6 GB 4060 Excellent 128k 🔲 Not installed
deepseek-r1:7b 5 GB 7B Q4_K_M ~6 GB 3550 ⚠️ Needs strip 128k Yes 🔲 Not installed
phi4:14b 9 GB 14B Q4_K_M ~11 GB 2535 Good 16k 🔲 Not installed
deepseek-coder-v2:16b 10 GB 16B Q4_K_M ~12 GB 2535 Good 128k 🔲 Not installed
codestral:22b 13 GB 22B Q4_K_M ~15 GB 2030 Good 32k 🔲 Not installed
gemini-2.5-flash Cloud ~1s/req Excellent 1M ☁️ Cloud ($0.003/run)
gpt-4o Cloud ~12s/req Excellent 128k ☁️ Cloud ($0.050.15/run)

Column Key

Column Meaning
Tok/s Tokens per second on M4 Pro 48 GB (Metal backend)
JSON Reliability of structured JSON output compliance
Reasoning Multi-step / chain-of-thought quality ( = weak, = best)
Code Code generation quality across TS/Python/Swift
Instruction Following Adherence to output format constraints
<think> Emits reasoning traces before output (needs stripping for JSON)

Gap Analysis vs llama3.3:70b (cloud-quality ceiling locally)

Gap llama3.1:8b qwen2.5-coder:32b deepseek-r1:32b
Multi-step reasoning ~40% ~65% ~80%
Strict JSON compliance ~75% ~95% ~70%*
Brain signal routing ~60% ~80% ~90%
Code generation ~55% ~95% ~80%
Instruction following ~70% ~90% ~85%
Overall vs 70B ~55% ~85% ~7580%

*With <think> strip transform applied


Hardware Guide (General)

For reference if running on different hardware:

RAM Max Model Size Recommendation
8 GB 7B qwen2.5-coder:7b
16 GB 13-16B deepseek-coder-v2:16b
24 GB 32B qwen2.5-coder:32b
32 GB 32B + headroom qwen2.5-coder:32b (comfortable)
48 GB 70B (Q4) llama3.1:70b or any 32B comfortably
64 GB+ 70B (Q8) Full precision 70B models