saravanakumardb1 5deb5efdcf docs(local-llms): add comprehensive model comparison table and deepseek-r1:32b details

- Add Comprehensive Model Comparison Table: 11 models (local + cloud) with
  Disk, Params, Quant, RAM, Tok/s, JSON quality, Reasoning, Code, Instruction
  Following, Context window, <think> flag, and install status columns
- Add Gap Analysis table: llama3.1:8b (~55%), qwen2.5-coder:32b (~85%),
  deepseek-r1:32b (~75-80%) vs llama3.3:70b across 5 capability dimensions
- Update Tier 4 Reasoning table: add Parameters, Quant columns; add <think>
  warning note with link to eval doc transform pattern
- Update By Use Case table: add brain signal routing row, update extraction
  evals fallback to qwen2.5-coder:32b

2026-02-19 16:06:02 -08:00

12 KiB

Raw Blame History

07 — Model Recommendations

Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.

Tier 1 — Best Overall Coding Models

Model	Size	RAM Used	Pull Command	Notes
`qwen2.5-coder:32b`	19 GB	~22 GB	`ollama pull qwen2.5-coder:32b`	Top pick — rivals GPT-4o on code, 128k context
`deepseek-coder-v2:16b`	10 GB	~12 GB	`ollama pull deepseek-coder-v2`	Best open-source coding model at 16B
`codestral:22b`	13 GB	~15 GB	`ollama pull codestral`	Mistral's coding model, very fast completions

Tier 2 — Fast & Capable (Speed/Quality Balance)

Model	Size	RAM Used	Pull Command	Notes
`qwen2.5-coder:7b`	5 GB	~6 GB	`ollama pull qwen2.5-coder:7b`	Fast, surprisingly good for TS/Python/Swift
`deepseek-coder:6.7b`	4 GB	~5 GB	`ollama pull deepseek-coder:6.7b`	Lightweight, solid everyday coding
`codegemma:7b`	5 GB	~6 GB	`ollama pull codegemma:7b`	Google's model, decent but outclassed by Qwen

Tier 3 — General Purpose (Also Good at Code)

Model	Size	RAM Used	Pull Command	Notes
`llama3.1:70b` (Q4)	40 GB	~42 GB	`ollama pull llama3.1:70b`	Best general model — tight on 48 GB
`llama3.1:8b`	4.9 GB	~6 GB	`ollama pull llama3.1:8b`	Fast, good for evals
`mistral-nemo:12b`	7 GB	~9 GB	`ollama pull mistral-nemo`	Fast reasoning
`phi4:14b`	9 GB	~11 GB	`ollama pull phi4`	Strong reasoning, fits comfortably

Tier 4 — Reasoning & Deep Thinking

Model	Size	Parameters	Quant	RAM Used	Pull Command	Notes
`deepseek-r1:32b`	20 GB	32B	Q4_K_M	~22 GB	`ollama pull deepseek-r1:32b`	Chain-of-thought reasoning — emits `<think>` blocks before JSON output; ~75–80% of llama3.3:70b reasoning quality at half the RAM
`deepseek-r1:7b`	5 GB	7B	Q4_K_M	~6 GB	`ollama pull deepseek-r1:7b`	Lightweight reasoning, good for quick triage

⚠️ JSON output note: DeepSeek R1 models emit <think>...</think> reasoning traces before the JSON response. Strip these before JSON.parse() — see 06-extraction-service-evals.md for the transform pattern.

Tier 5 — Vision (Multimodal)

Model	Size	RAM Used	Pull Command	Notes
`llava:34b`	22 GB	~22 GB	`ollama pull llava:34b`	Image understanding, OCR
`qwen2.5vl:7b`	6 GB	~6 GB	`ollama pull qwen2.5vl:7b`	Qwen vision, fast
`minicpm-v:8b`	6 GB	~6 GB	`ollama pull minicpm-v:8b`	Strong OCR
`moondream2`	2 GB	~2 GB	`ollama pull moondream2`	Tiny, basic vision

Tier 6 — Embeddings

Model	Size	RAM Used	Pull Command	Notes
`nomic-embed-text`	0.3 GB	~0.5 GB	`ollama pull nomic-embed-text`	Good for semantic search
`mxbai-embed-large`	0.7 GB	~1 GB	`ollama pull mxbai-embed-large`	Higher quality embeddings

Recommended 10-Model Stack for M4 Pro 48 GB

For maximum coverage across all use cases:

#	Model	Disk	Use Case
1	`qwen2.5-coder:32b`	19 GB	Primary — coding (TS, Python, Swift)
2	`qwen2.5-coder:7b`	5 GB	Fast coding completions
3	`deepseek-coder-v2:16b`	10 GB	Alternative coding model
4	`llama3.1:8b`	4.9 GB	Eval default, general tasks
5	`deepseek-r1:32b`	20 GB	Deep reasoning, complex triage
6	`codestral:22b`	13 GB	Fast code completions (Mistral)
7	`phi4:14b`	9 GB	Reasoning, structured output
8	`llava:34b`	22 GB	Vision / image understanding
9	`mistral-nemo:12b`	7 GB	Fast general purpose
10	`nomic-embed-text`	0.3 GB	Embeddings / semantic search
	Total	~115 GB

Only one loads into RAM at a time. You can have all 10 on disk simultaneously.

By Use Case (Quick Reference)

Use Case	Best Model	Fallback
TypeScript/ESM coding	`qwen2.5-coder:32b`	`qwen2.5-coder:7b`
Python coding	`qwen2.5-coder:32b`	`deepseek-coder-v2:16b`
Swift/iOS coding	`qwen2.5-coder:32b`	`codestral:22b`
Extraction evals	`llama3.1:8b`	`qwen2.5-coder:32b`
JSON structured output	`qwen2.5-coder:32b`	`qwen2.5:7b`
Complex reasoning / triage	`deepseek-r1:32b`	`phi4:14b`
Brain signal routing	`deepseek-r1:32b`	`qwen2.5-coder:32b`
Image understanding	`llava:34b`	`qwen2.5vl:7b`
Embeddings	`nomic-embed-text`	`mxbai-embed-large`
Fast iteration / dev evals	`llama3.1:8b`	`qwen2.5-coder:7b`

Comprehensive Model Comparison Table

All models discussed — detailed capability reference for M4 Pro 48 GB:

Model	Disk	Params	Quant	RAM	Tok/s	JSON	Reasoning	Code	Instruction Following	Context	`<think>`	Status on this machine
`llama3.1:8b`	4.9 GB	8B	Q4_K_M	~6 GB	40–60	✅ Good	⭐⭐	⭐⭐	⭐⭐⭐	128k	❌	✅ Installed
`qwen2.5-coder:32b`	18.5 GB	32.8B	Q4_K_M	~22 GB	15–25	✅ Excellent	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	128k	❌	✅ Installed
`deepseek-r1:32b`	20 GB	32B	Q4_K_M	~22 GB	12–20	⚠️ Needs strip	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	128k	✅ Yes	🔲 Not installed
`llama3.3:70b` (Q4)	40 GB	70B	Q4_K_M	~42 GB	5–10	✅ Excellent	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	128k	❌	⚠️ Tight (6GB left for OS)
`qwen2.5:7b`	5 GB	7B	Q4_K_M	~6 GB	40–60	✅ Excellent	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	128k	❌	🔲 Not installed
`deepseek-r1:7b`	5 GB	7B	Q4_K_M	~6 GB	35–50	⚠️ Needs strip	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	128k	✅ Yes	🔲 Not installed
`phi4:14b`	9 GB	14B	Q4_K_M	~11 GB	25–35	✅ Good	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	16k	❌	🔲 Not installed
`deepseek-coder-v2:16b`	10 GB	16B	Q4_K_M	~12 GB	25–35	✅ Good	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	128k	❌	🔲 Not installed
`codestral:22b`	13 GB	22B	Q4_K_M	~15 GB	20–30	✅ Good	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	32k	❌	🔲 Not installed
`gemini-2.5-flash`	—	—	Cloud	—	~1s/req	✅ Excellent	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	1M	❌	☁️ Cloud ($0.003/run)
`gpt-4o`	—	—	Cloud	—	~1–2s/req	✅ Excellent	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	128k	❌	☁️ Cloud ($0.05–0.15/run)

Column Key

Column	Meaning
Tok/s	Tokens per second on M4 Pro 48 GB (Metal backend)
JSON	Reliability of structured JSON output compliance
Reasoning	Multi-step / chain-of-thought quality (⭐ = weak, ⭐⭐⭐⭐⭐ = best)
Code	Code generation quality across TS/Python/Swift
Instruction Following	Adherence to output format constraints
`<think>`	Emits reasoning traces before output (needs stripping for JSON)

Gap Analysis vs llama3.3:70b (cloud-quality ceiling locally)

Gap	`llama3.1:8b`	`qwen2.5-coder:32b`	`deepseek-r1:32b`
Multi-step reasoning	~40%	~65%	~80%
Strict JSON compliance	~75%	~95%	~70%*
Brain signal routing	~60%	~80%	~90%
Code generation	~55%	~95%	~80%
Instruction following	~70%	~90%	~85%
Overall vs 70B	~55%	~85%	~75–80%

*With <think> strip transform applied

Hardware Guide (General)

For reference if running on different hardware:

RAM	Max Model Size	Recommendation
8 GB	7B	`qwen2.5-coder:7b`
16 GB	13-16B	`deepseek-coder-v2:16b`
24 GB	32B	`qwen2.5-coder:32b`
32 GB	32B + headroom	`qwen2.5-coder:32b` (comfortable)
48 GB	70B (Q4)	`llama3.1:70b` or any 32B comfortably
64 GB+	70B (Q8)	Full precision 70B models

12 KiB Raw Blame History Unescape Escape