07 — Model Recommendations
Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.
Tier 1 — Best Overall Coding Models
| Model |
Size |
RAM Used |
Pull Command |
Notes |
qwen2.5-coder:32b |
19 GB |
~22 GB |
ollama pull qwen2.5-coder:32b |
Top pick — rivals GPT-4o on code, 128k context |
deepseek-coder-v2:16b |
10 GB |
~12 GB |
ollama pull deepseek-coder-v2 |
Best open-source coding model at 16B |
codestral:22b |
13 GB |
~15 GB |
ollama pull codestral |
Mistral's coding model, very fast completions |
Tier 2 — Fast & Capable (Speed/Quality Balance)
| Model |
Size |
RAM Used |
Pull Command |
Notes |
qwen2.5-coder:7b |
5 GB |
~6 GB |
ollama pull qwen2.5-coder:7b |
Fast, surprisingly good for TS/Python/Swift |
deepseek-coder:6.7b |
4 GB |
~5 GB |
ollama pull deepseek-coder:6.7b |
Lightweight, solid everyday coding |
codegemma:7b |
5 GB |
~6 GB |
ollama pull codegemma:7b |
Google's model, decent but outclassed by Qwen |
Tier 3 — General Purpose (Also Good at Code)
| Model |
Size |
RAM Used |
Pull Command |
Notes |
llama3.1:70b (Q4) |
40 GB |
~42 GB |
ollama pull llama3.1:70b |
Best general model — tight on 48 GB |
llama3.1:8b |
4.9 GB |
~6 GB |
ollama pull llama3.1:8b |
Fast, good for evals |
mistral-nemo:12b |
7 GB |
~9 GB |
ollama pull mistral-nemo |
Fast reasoning |
phi4:14b |
9 GB |
~11 GB |
ollama pull phi4 |
Strong reasoning, fits comfortably |
Tier 4 — Reasoning & Deep Thinking
| Model |
Size |
Parameters |
Quant |
RAM Used |
Pull Command |
Notes |
deepseek-r1:32b |
20 GB |
32B |
Q4_K_M |
~22 GB |
ollama pull deepseek-r1:32b |
Chain-of-thought reasoning — emits <think> blocks before JSON output; ~75–80% of llama3.3:70b reasoning quality at half the RAM |
deepseek-r1:7b |
5 GB |
7B |
Q4_K_M |
~6 GB |
ollama pull deepseek-r1:7b |
Lightweight reasoning, good for quick triage |
⚠️ JSON output note: DeepSeek R1 models emit <think>...</think> reasoning traces before the JSON response. Strip these before JSON.parse() — see 06-extraction-service-evals.md for the transform pattern.
Tier 5 — Vision (Multimodal)
| Model |
Size |
RAM Used |
Pull Command |
Notes |
llava:34b |
22 GB |
~22 GB |
ollama pull llava:34b |
Image understanding, OCR |
qwen2.5vl:7b |
6 GB |
~6 GB |
ollama pull qwen2.5vl:7b |
Qwen vision, fast |
minicpm-v:8b |
6 GB |
~6 GB |
ollama pull minicpm-v:8b |
Strong OCR |
moondream2 |
2 GB |
~2 GB |
ollama pull moondream2 |
Tiny, basic vision |
Tier 6 — Embeddings
| Model |
Size |
RAM Used |
Pull Command |
Notes |
nomic-embed-text |
0.3 GB |
~0.5 GB |
ollama pull nomic-embed-text |
Good for semantic search |
mxbai-embed-large |
0.7 GB |
~1 GB |
ollama pull mxbai-embed-large |
Higher quality embeddings |
Recommended 10-Model Stack for M4 Pro 48 GB
For maximum coverage across all use cases:
| # |
Model |
Disk |
Use Case |
| 1 |
qwen2.5-coder:32b |
19 GB |
Primary — coding (TS, Python, Swift) |
| 2 |
qwen2.5-coder:7b |
5 GB |
Fast coding completions |
| 3 |
deepseek-coder-v2:16b |
10 GB |
Alternative coding model |
| 4 |
llama3.1:8b |
4.9 GB |
Eval default, general tasks |
| 5 |
deepseek-r1:32b |
20 GB |
Deep reasoning, complex triage |
| 6 |
codestral:22b |
13 GB |
Fast code completions (Mistral) |
| 7 |
phi4:14b |
9 GB |
Reasoning, structured output |
| 8 |
llava:34b |
22 GB |
Vision / image understanding |
| 9 |
mistral-nemo:12b |
7 GB |
Fast general purpose |
| 10 |
nomic-embed-text |
0.3 GB |
Embeddings / semantic search |
|
Total |
~115 GB |
|
Only one loads into RAM at a time. You can have all 10 on disk simultaneously.
By Use Case (Quick Reference)
| Use Case |
Best Model |
Fallback |
| TypeScript/ESM coding |
qwen2.5-coder:32b |
qwen2.5-coder:7b |
| Python coding |
qwen2.5-coder:32b |
deepseek-coder-v2:16b |
| Swift/iOS coding |
qwen2.5-coder:32b |
codestral:22b |
| Extraction evals |
llama3.1:8b |
qwen2.5-coder:32b |
| JSON structured output |
qwen2.5-coder:32b |
qwen2.5:7b |
| Complex reasoning / triage |
deepseek-r1:32b |
phi4:14b |
| Brain signal routing |
deepseek-r1:32b |
qwen2.5-coder:32b |
| Image understanding |
llava:34b |
qwen2.5vl:7b |
| Embeddings |
nomic-embed-text |
mxbai-embed-large |
| Fast iteration / dev evals |
llama3.1:8b |
qwen2.5-coder:7b |
Comprehensive Model Comparison Table
All models discussed — detailed capability reference for M4 Pro 48 GB:
| Model |
Disk |
Params |
Quant |
RAM |
Tok/s |
JSON |
Reasoning |
Code |
Instruction Following |
Context |
<think> |
Status on this machine |
llama3.1:8b |
4.9 GB |
8B |
Q4_K_M |
~6 GB |
40–60 |
✅ Good |
⭐⭐ |
⭐⭐ |
⭐⭐⭐ |
128k |
❌ |
✅ Installed |
qwen2.5-coder:32b |
18.5 GB |
32.8B |
Q4_K_M |
~22 GB |
15–25 |
✅ Excellent |
⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
128k |
❌ |
✅ Installed |
deepseek-r1:32b |
20 GB |
32B |
Q4_K_M |
~22 GB |
12–20 |
⚠️ Needs strip |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
128k |
✅ Yes |
🔲 Not installed |
llama3.3:70b (Q4) |
40 GB |
70B |
Q4_K_M |
~42 GB |
5–10 |
✅ Excellent |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
128k |
❌ |
⚠️ Tight (6GB left for OS) |
qwen2.5:7b |
5 GB |
7B |
Q4_K_M |
~6 GB |
40–60 |
✅ Excellent |
⭐⭐ |
⭐⭐⭐ |
⭐⭐⭐⭐ |
128k |
❌ |
🔲 Not installed |
deepseek-r1:7b |
5 GB |
7B |
Q4_K_M |
~6 GB |
35–50 |
⚠️ Needs strip |
⭐⭐⭐⭐ |
⭐⭐⭐ |
⭐⭐⭐ |
128k |
✅ Yes |
🔲 Not installed |
phi4:14b |
9 GB |
14B |
Q4_K_M |
~11 GB |
25–35 |
✅ Good |
⭐⭐⭐⭐ |
⭐⭐⭐ |
⭐⭐⭐⭐ |
16k |
❌ |
🔲 Not installed |
deepseek-coder-v2:16b |
10 GB |
16B |
Q4_K_M |
~12 GB |
25–35 |
✅ Good |
⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
128k |
❌ |
🔲 Not installed |
codestral:22b |
13 GB |
22B |
Q4_K_M |
~15 GB |
20–30 |
✅ Good |
⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
32k |
❌ |
🔲 Not installed |
gemini-2.5-flash |
— |
— |
Cloud |
— |
~1s/req |
✅ Excellent |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
1M |
❌ |
☁️ Cloud ($0.003/run) |
gpt-4o |
— |
— |
Cloud |
— |
~1–2s/req |
✅ Excellent |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
128k |
❌ |
☁️ Cloud ($0.05–0.15/run) |
Column Key
| Column |
Meaning |
| Tok/s |
Tokens per second on M4 Pro 48 GB (Metal backend) |
| JSON |
Reliability of structured JSON output compliance |
| Reasoning |
Multi-step / chain-of-thought quality (⭐ = weak, ⭐⭐⭐⭐⭐ = best) |
| Code |
Code generation quality across TS/Python/Swift |
| Instruction Following |
Adherence to output format constraints |
<think> |
Emits reasoning traces before output (needs stripping for JSON) |
Gap Analysis vs llama3.3:70b (cloud-quality ceiling locally)
| Gap |
llama3.1:8b |
qwen2.5-coder:32b |
deepseek-r1:32b |
| Multi-step reasoning |
~40% |
~65% |
~80% |
| Strict JSON compliance |
~75% |
~95% |
~70%* |
| Brain signal routing |
~60% |
~80% |
~90% |
| Code generation |
~55% |
~95% |
~80% |
| Instruction following |
~70% |
~90% |
~85% |
| Overall vs 70B |
~55% |
~85% |
~75–80% |
*With <think> strip transform applied
Hardware Guide (General)
For reference if running on different hardware:
| RAM |
Max Model Size |
Recommendation |
| 8 GB |
7B |
qwen2.5-coder:7b |
| 16 GB |
13-16B |
deepseek-coder-v2:16b |
| 24 GB |
32B |
qwen2.5-coder:32b |
| 32 GB |
32B + headroom |
qwen2.5-coder:32b (comfortable) |
| 48 GB |
70B (Q4) |
llama3.1:70b or any 32B comfortably |
| 64 GB+ |
70B (Q8) |
Full precision 70B models |