learning_ai_common_plat/__LOCAL_LLMs/docs/00-developer-guide.md
2026-02-21 14:13:07 -08:00

8.4 KiB
Raw Permalink Blame History

00 — Developer Guide: Local LLM with Ollama

How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.


What Is This?

This machine runs a local LLM server via Ollama, exposing an OpenAI-compatible API at http://localhost:11434/v1. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.

Models installed:

Model Size Best For
qwen2.5-coder:32b 19 GB Code (TS, Python, Swift), structured JSON
qwen2.5-coder:7b 4.7 GB Fast code tasks, fits alongside other models
deepseek-r1:32b 19 GB Complex reasoning, chain-of-thought
llama3.1:8b 4.9 GB Fast evals, general tasks
sematre/orpheus:en 4 GB Text-to-speech (8 voices, emotion tags)

Quick Start

1. Check Ollama is running

curl http://localhost:11434/api/tags

If it returns a JSON list of models — you're good. If it fails:

ollama serve          # start in foreground
# or
brew services start ollama   # start as background service

2. List available models

ollama list

3. Chat with a model (interactive)

ollama run llama3.1:8b
ollama run qwen2.5-coder:32b

API Endpoint Reference

Property Value
Base URL http://localhost:11434/v1
API Key ollama (any non-empty string works)
Protocol OpenAI-compatible REST
Models endpoint http://localhost:11434/api/tags
Loaded models http://localhost:11434/api/ps

Using in Code

curl

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
    "response_format": {"type": "json_object"}
  }'

TypeScript / Node.js (OpenAI SDK)

import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await ollama.chat.completions.create({
  model: 'qwen2.5-coder:32b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});

console.log(res.choices[0].message.content);

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

Instead of hardcoding the URL, use env vars so code works with both local and cloud:

# .env.local (local dev)
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# .env.production
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
LLM_MODEL=gpt-4o
const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL,
  apiKey: process.env.OPENAI_API_KEY,
});

Running Extraction Service Evals Locally

The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.

cd services/extraction-service

# Run evals with default model (llama3.1:8b)
pnpm eval:ollama

# Run with a different model
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama

# Run unattended with logging + macOS notification on completion
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh

Logs are written to evals/logs/. See 06-extraction-service-evals.md for full details.


Pointing the Extraction Service Python Sidecar at Ollama

By default the sidecar uses Gemini. Override with:

export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b

Model Management

# Pull a new model
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b

# See what's loaded in RAM right now
ollama ps

# Unload a model from RAM (free up memory)
curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'

# Remove a model from disk
ollama rm <model>

Choosing the Right Model

Task Recommended Model Why
TypeScript / Python / Swift code qwen2.5-coder:32b Best code quality locally
Fast evals / iteration llama3.1:8b 4060 tok/s, low RAM
Structured JSON extraction qwen2.5-coder:32b Excellent format compliance
Complex reasoning / triage deepseek-r1:32b Chain-of-thought, ~80% of 70B
Quick one-off questions llama3.1:8b Fastest response

See 07-model-recommendations.md for the full comparison table.


Important: JSON Output

Always request JSON mode explicitly — models are more reliable with it:

response_format: {
  type: 'json_object';
}

When parsing in promptfoo assertions, output is a raw string — parse it first:

// ✅ Correct
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })

// ❌ Wrong — output is not already an object
output.extractions.map(...)

DeepSeek R1 — strip <think> blocks

R1 models emit reasoning traces before JSON. Strip them:

const raw = res.choices[0].message.content;
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
const result = JSON.parse(json);

Troubleshooting

Problem Fix
connection refused on port 11434 Run ollama serve or brew services start ollama
Model pull fails / hangs Use NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
MLX dynamic library not available warning Harmless — Metal backend is used automatically
Slow responses Check ollama ps — model may be loading cold from disk (first request takes 515s)
Out of memory Run ollama ps and unload unused models with keep_alive: "0"
JSON parse error with R1 models Strip <think>...</think> block before parsing

See 08-troubleshooting.md for more.


Further Reading

Doc Contents
01-hardware-and-prerequisites.md M4 Pro specs, disk/RAM budget
02-ollama-setup-and-models.md Installation, server config, memory management
06-extraction-service-evals.md promptfoo eval suite, assertion patterns, latency comparison
07-model-recommendations.md Full model comparison table, gap analysis vs 70B
08-troubleshooting.md Common issues and fixes
09-environment-variables.md All config env vars