saravanakumardb1 3561deee52 docs(local-llm): add multimodal stack, model recommendations, and troubleshooting

- docs/04-multimodal-local-stack.md: vision models (llava, qwen2.5vl, moondream2),
  audio pipeline architecture, video understanding status, Kimi alternatives,
  complete local AI stack diagram
- docs/07-model-recommendations.md: 6-tier model guide (coding, fast, general,
  reasoning, vision, embeddings), recommended 10-model stack for M4 Pro 48GB,
  use-case quick reference, hardware scaling guide
- docs/08-troubleshooting.md: corporate Forcepoint proxy workarounds, MLX warning,
  JSON parse errors, slow inference, whisper-cli vs whisper-cpp naming, audio
  format conversion, proxy-corrupted downloads detection

2026-02-19 13:01:22 -08:00

6.6 KiB

Raw Blame History

07 — Model Recommendations

Tiered model guide by use case, size, and quality for Apple M4 Pro with 48 GB unified memory.

Tier 1 — Best Overall Coding Models

Model	Size	RAM Used	Pull Command	Notes
`qwen2.5-coder:32b`	19 GB	~22 GB	`ollama pull qwen2.5-coder:32b`	Top pick — rivals GPT-4o on code, 128k context
`deepseek-coder-v2:16b`	10 GB	~12 GB	`ollama pull deepseek-coder-v2`	Best open-source coding model at 16B
`codestral:22b`	13 GB	~15 GB	`ollama pull codestral`	Mistral's coding model, very fast completions

Tier 2 — Fast & Capable (Speed/Quality Balance)

Model	Size	RAM Used	Pull Command	Notes
`qwen2.5-coder:7b`	5 GB	~6 GB	`ollama pull qwen2.5-coder:7b`	Fast, surprisingly good for TS/Python/Swift
`deepseek-coder:6.7b`	4 GB	~5 GB	`ollama pull deepseek-coder:6.7b`	Lightweight, solid everyday coding
`codegemma:7b`	5 GB	~6 GB	`ollama pull codegemma:7b`	Google's model, decent but outclassed by Qwen

Tier 3 — General Purpose (Also Good at Code)

Model	Size	RAM Used	Pull Command	Notes
`llama3.1:70b` (Q4)	40 GB	~42 GB	`ollama pull llama3.1:70b`	Best general model — tight on 48 GB
`llama3.1:8b`	4.9 GB	~6 GB	`ollama pull llama3.1:8b`	Fast, good for evals
`mistral-nemo:12b`	7 GB	~9 GB	`ollama pull mistral-nemo`	Fast reasoning
`phi4:14b`	9 GB	~11 GB	`ollama pull phi4`	Strong reasoning, fits comfortably

Tier 4 — Reasoning & Deep Thinking

Model	Size	RAM Used	Pull Command	Notes
`deepseek-r1:32b`	20 GB	~22 GB	`ollama pull deepseek-r1:32b`	Chain-of-thought reasoning, closest to Kimi k1.5
`deepseek-r1:7b`	5 GB	~6 GB	`ollama pull deepseek-r1:7b`	Lightweight reasoning

Tier 5 — Vision (Multimodal)

Model	Size	RAM Used	Pull Command	Notes
`llava:34b`	22 GB	~22 GB	`ollama pull llava:34b`	Image understanding, OCR
`qwen2.5vl:7b`	6 GB	~6 GB	`ollama pull qwen2.5vl:7b`	Qwen vision, fast
`minicpm-v:8b`	6 GB	~6 GB	`ollama pull minicpm-v:8b`	Strong OCR
`moondream2`	2 GB	~2 GB	`ollama pull moondream2`	Tiny, basic vision

Tier 6 — Embeddings

Model	Size	RAM Used	Pull Command	Notes
`nomic-embed-text`	0.3 GB	~0.5 GB	`ollama pull nomic-embed-text`	Good for semantic search
`mxbai-embed-large`	0.7 GB	~1 GB	`ollama pull mxbai-embed-large`	Higher quality embeddings

Recommended 10-Model Stack for M4 Pro 48 GB

For maximum coverage across all use cases:

#	Model	Disk	Use Case
1	`qwen2.5-coder:32b`	19 GB	Primary — coding (TS, Python, Swift)
2	`qwen2.5-coder:7b`	5 GB	Fast coding completions
3	`deepseek-coder-v2:16b`	10 GB	Alternative coding model
4	`llama3.1:8b`	4.9 GB	Eval default, general tasks
5	`deepseek-r1:32b`	20 GB	Deep reasoning, complex triage
6	`codestral:22b`	13 GB	Fast code completions (Mistral)
7	`phi4:14b`	9 GB	Reasoning, structured output
8	`llava:34b`	22 GB	Vision / image understanding
9	`mistral-nemo:12b`	7 GB	Fast general purpose
10	`nomic-embed-text`	0.3 GB	Embeddings / semantic search
	Total	~115 GB

Only one loads into RAM at a time. You can have all 10 on disk simultaneously.

By Use Case (Quick Reference)

Use Case	Best Model	Fallback
TypeScript/ESM coding	`qwen2.5-coder:32b`	`qwen2.5-coder:7b`
Python coding	`qwen2.5-coder:32b`	`deepseek-coder-v2:16b`
Swift/iOS coding	`qwen2.5-coder:32b`	`codestral:22b`
Extraction evals	`llama3.1:8b`	`qwen2.5:7b`
JSON structured output	`qwen2.5:7b`	`qwen2.5-coder:7b`
Complex reasoning	`deepseek-r1:32b`	`phi4:14b`
Image understanding	`llava:34b`	`qwen2.5vl:7b`
Embeddings	`nomic-embed-text`	`mxbai-embed-large`
Fast iteration	`qwen2.5-coder:7b`	`llama3.1:8b`

Hardware Guide (General)

For reference if running on different hardware:

RAM	Max Model Size	Recommendation
8 GB	7B	`qwen2.5-coder:7b`
16 GB	13-16B	`deepseek-coder-v2:16b`
24 GB	32B	`qwen2.5-coder:32b`
32 GB	32B + headroom	`qwen2.5-coder:32b` (comfortable)
48 GB	70B (Q4)	`llama3.1:70b` or any 32B comfortably
64 GB+	70B (Q8)	Full precision 70B models

6.6 KiB Raw Blame History

07 — Model Recommendations

Tier 1 — Best Overall Coding Models

Tier 2 — Fast & Capable (Speed/Quality Balance)

Tier 3 — General Purpose (Also Good at Code)

Tier 4 — Reasoning & Deep Thinking

Tier 5 — Vision (Multimodal)

Tier 6 — Embeddings

Recommended 10-Model Stack for M4 Pro 48 GB

By Use Case (Quick Reference)

Hardware Guide (General)

6.6 KiB

Raw Blame History