- New 00-developer-guide.md: start-here doc for developers covering: - Ollama endpoint (http://localhost:11434/v1) and API key - curl, TypeScript, Python code examples with env var pattern - Model selection table by task - Running extraction service evals locally - JSON output gotchas (parse from string, <think> strip for R1) - Model management commands - Troubleshooting quick reference - Links to all other docs - Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
8.2 KiB
00 — Developer Guide: Local LLM with Ollama
How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
What Is This?
This machine runs a local LLM server via Ollama, exposing an OpenAI-compatible API at http://localhost:11434/v1. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
Models installed:
| Model | Size | Best For |
|---|---|---|
qwen2.5-coder:32b |
18.5 GB | Code (TS, Python, Swift), structured JSON |
llama3.1:8b |
4.7 GB | Fast evals, general tasks |
Quick Start
1. Check Ollama is running
curl http://localhost:11434/api/tags
If it returns a JSON list of models — you're good. If it fails:
ollama serve # start in foreground
# or
brew services start ollama # start as background service
2. List available models
ollama list
3. Chat with a model (interactive)
ollama run llama3.1:8b
ollama run qwen2.5-coder:32b
API Endpoint Reference
| Property | Value |
|---|---|
| Base URL | http://localhost:11434/v1 |
| API Key | ollama (any non-empty string works) |
| Protocol | OpenAI-compatible REST |
| Models endpoint | http://localhost:11434/api/tags |
| Loaded models | http://localhost:11434/api/ps |
Using in Code
curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
"response_format": {"type": "json_object"}
}'
TypeScript / Node.js (OpenAI SDK)
import OpenAI from 'openai';
const ollama = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await ollama.chat.completions.create({
model: 'qwen2.5-coder:32b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
console.log(res.choices[0].message.content);
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
print(response.choices[0].message.content)
Environment variable pattern (recommended)
Instead of hardcoding the URL, use env vars so code works with both local and cloud:
# .env.local (local dev)
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama
LLM_MODEL=llama3.1:8b
# .env.production
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
LLM_MODEL=gpt-4o
const client = new OpenAI({
baseURL: process.env.OPENAI_BASE_URL,
apiKey: process.env.OPENAI_API_KEY,
});
Running Extraction Service Evals Locally
The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
cd services/extraction-service
# Run evals with default model (llama3.1:8b)
pnpm eval:ollama
# Run with a different model
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
# Run unattended with logging + macOS notification on completion
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
Logs are written to evals/logs/. See 06-extraction-service-evals.md for full details.
Pointing the Extraction Service Python Sidecar at Ollama
By default the sidecar uses Gemini. Override with:
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
Model Management
# Pull a new model
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
# See what's loaded in RAM right now
ollama ps
# Unload a model from RAM (free up memory)
curl http://localhost:11434/api/generate \
-d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
# Remove a model from disk
ollama rm <model>
Choosing the Right Model
| Task | Recommended Model | Why |
|---|---|---|
| TypeScript / Python / Swift code | qwen2.5-coder:32b |
Best code quality locally |
| Fast evals / iteration | llama3.1:8b |
40–60 tok/s, low RAM |
| Structured JSON extraction | qwen2.5-coder:32b |
Excellent format compliance |
| Complex reasoning / triage | deepseek-r1:32b |
Chain-of-thought, ~80% of 70B |
| Quick one-off questions | llama3.1:8b |
Fastest response |
See 07-model-recommendations.md for the full comparison table.
Important: JSON Output
Always request JSON mode explicitly — models are more reliable with it:
response_format: {
type: 'json_object';
}
When parsing in promptfoo assertions, output is a raw string — parse it first:
// ✅ Correct
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
// ❌ Wrong — output is not already an object
output.extractions.map(...)
DeepSeek R1 — strip <think> blocks
R1 models emit reasoning traces before JSON. Strip them:
const raw = res.choices[0].message.content;
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
const result = JSON.parse(json);
Troubleshooting
| Problem | Fix |
|---|---|
connection refused on port 11434 |
Run ollama serve or brew services start ollama |
| Model pull fails / hangs | Use NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model> |
MLX dynamic library not available warning |
Harmless — Metal backend is used automatically |
| Slow responses | Check ollama ps — model may be loading cold from disk (first request takes 5–15s) |
| Out of memory | Run ollama ps and unload unused models with keep_alive: "0" |
| JSON parse error with R1 models | Strip <think>...</think> block before parsing |
See 08-troubleshooting.md for more.
Further Reading
| Doc | Contents |
|---|---|
| 01-hardware-and-prerequisites.md | M4 Pro specs, disk/RAM budget |
| 02-ollama-setup-and-models.md | Installation, server config, memory management |
| 06-extraction-service-evals.md | promptfoo eval suite, assertion patterns, latency comparison |
| 07-model-recommendations.md | Full model comparison table, gap analysis vs 70B |
| 08-troubleshooting.md | Common issues and fixes |
| 09-environment-variables.md | All config env vars |