saravanakumardb1 4090c8aa13 docs(local-llms): add developer guide — API endpoint, code examples, model selection

- New 00-developer-guide.md: start-here doc for developers covering:
  - Ollama endpoint (http://localhost:11434/v1) and API key
  - curl, TypeScript, Python code examples with env var pattern
  - Model selection table by task
  - Running extraction service evals locally
  - JSON output gotchas (parse from string, <think> strip for R1)
  - Model management commands
  - Troubleshooting quick reference
  - Links to all other docs
- Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00

2026-02-19 18:43:06 -08:00

8.2 KiB

Raw Blame History

00 — Developer Guide: Local LLM with Ollama

How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.

What Is This?

This machine runs a local LLM server via Ollama, exposing an OpenAI-compatible API at http://localhost:11434/v1. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.

Models installed:

Model	Size	Best For
`qwen2.5-coder:32b`	18.5 GB	Code (TS, Python, Swift), structured JSON
`llama3.1:8b`	4.7 GB	Fast evals, general tasks

Quick Start

1. Check Ollama is running

curl http://localhost:11434/api/tags

If it returns a JSON list of models — you're good. If it fails:

ollama serve          # start in foreground
# or
brew services start ollama   # start as background service

2. List available models

ollama list

3. Chat with a model (interactive)

ollama run llama3.1:8b
ollama run qwen2.5-coder:32b

API Endpoint Reference

Property	Value
Base URL	`http://localhost:11434/v1`
API Key	`ollama` (any non-empty string works)
Protocol	OpenAI-compatible REST
Models endpoint	`http://localhost:11434/api/tags`
Loaded models	`http://localhost:11434/api/ps`

Using in Code

curl

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
    "response_format": {"type": "json_object"}
  }'

TypeScript / Node.js (OpenAI SDK)

import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await ollama.chat.completions.create({
  model: 'qwen2.5-coder:32b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});

console.log(res.choices[0].message.content);

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

Environment variable pattern (recommended)

Instead of hardcoding the URL, use env vars so code works with both local and cloud:

# .env.local (local dev)
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# .env.production
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
LLM_MODEL=gpt-4o

const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL,
  apiKey: process.env.OPENAI_API_KEY,
});

Running Extraction Service Evals Locally

The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.

cd services/extraction-service

# Run evals with default model (llama3.1:8b)
pnpm eval:ollama

# Run with a different model
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama

# Run unattended with logging + macOS notification on completion
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh

Logs are written to evals/logs/. See 06-extraction-service-evals.md for full details.

Pointing the Extraction Service Python Sidecar at Ollama

By default the sidecar uses Gemini. Override with:

export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b

Model Management

# Pull a new model
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b

# See what's loaded in RAM right now
ollama ps

# Unload a model from RAM (free up memory)
curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'

# Remove a model from disk
ollama rm <model>

Choosing the Right Model

Task	Recommended Model	Why
TypeScript / Python / Swift code	`qwen2.5-coder:32b`	Best code quality locally
Fast evals / iteration	`llama3.1:8b`	40–60 tok/s, low RAM
Structured JSON extraction	`qwen2.5-coder:32b`	Excellent format compliance
Complex reasoning / triage	`deepseek-r1:32b`	Chain-of-thought, ~80% of 70B
Quick one-off questions	`llama3.1:8b`	Fastest response

See 07-model-recommendations.md for the full comparison table.

Important: JSON Output

Always request JSON mode explicitly — models are more reliable with it:

response_format: {
  type: 'json_object';
}

When parsing in promptfoo assertions, output is a raw string — parse it first:

// ✅ Correct
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })

// ❌ Wrong — output is not already an object
output.extractions.map(...)

DeepSeek R1 — strip `<think>` blocks

R1 models emit reasoning traces before JSON. Strip them:

const raw = res.choices[0].message.content;
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
const result = JSON.parse(json);

Troubleshooting

Problem	Fix
`connection refused` on port 11434	Run `ollama serve` or `brew services start ollama`
Model pull fails / hangs	Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>`
`MLX dynamic library not available` warning	Harmless — Metal backend is used automatically
Slow responses	Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s)
Out of memory	Run `ollama ps` and unload unused models with `keep_alive: "0"`
JSON parse error with R1 models	Strip `<think>...</think>` block before parsing

See 08-troubleshooting.md for more.

Doc	Contents
01-hardware-and-prerequisites.md	M4 Pro specs, disk/RAM budget
02-ollama-setup-and-models.md	Installation, server config, memory management
06-extraction-service-evals.md	promptfoo eval suite, assertion patterns, latency comparison
07-model-recommendations.md	Full model comparison table, gap analysis vs 70B
08-troubleshooting.md	Common issues and fixes
09-environment-variables.md	All config env vars

8.2 KiB Raw Blame History Unescape Escape