# 00 — Developer Guide: Local LLM with Ollama > How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues. --- ## What Is This? This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK. **Models installed:** | Model | Size | Best For | | -------------------- | ------ | -------------------------------------------- | | `qwen2.5-coder:32b` | 19 GB | Code (TS, Python, Swift), structured JSON | | `qwen2.5-coder:7b` | 4.7 GB | Fast code tasks, fits alongside other models | | `deepseek-r1:32b` | 19 GB | Complex reasoning, chain-of-thought | | `llama3.1:8b` | 4.9 GB | Fast evals, general tasks | | `sematre/orpheus:en` | 4 GB | Text-to-speech (8 voices, emotion tags) | --- ## Quick Start ### 1. Check Ollama is running ```bash curl http://localhost:11434/api/tags ``` If it returns a JSON list of models — you're good. If it fails: ```bash ollama serve # start in foreground # or brew services start ollama # start as background service ``` ### 2. List available models ```bash ollama list ``` ### 3. Chat with a model (interactive) ```bash ollama run llama3.1:8b ollama run qwen2.5-coder:32b ``` --- ## API Endpoint Reference | Property | Value | | ------------------- | ------------------------------------- | | **Base URL** | `http://localhost:11434/v1` | | **API Key** | `ollama` (any non-empty string works) | | **Protocol** | OpenAI-compatible REST | | **Models endpoint** | `http://localhost:11434/api/tags` | | **Loaded models** | `http://localhost:11434/api/ps` | --- ## Using in Code ### curl ```bash curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1:8b", "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}], "response_format": {"type": "json_object"} }' ``` ### TypeScript / Node.js (OpenAI SDK) ```typescript import OpenAI from 'openai'; const ollama = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama', }); const res = await ollama.chat.completions.create({ model: 'qwen2.5-coder:32b', messages: [{ role: 'user', content: 'Extract action items from: ...' }], response_format: { type: 'json_object' }, }); console.log(res.choices[0].message.content); ``` ### Python (OpenAI SDK) ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama", ) response = client.chat.completions.create( model="llama3.1:8b", messages=[{"role": "user", "content": "Extract action items from: ..."}], response_format={"type": "json_object"}, ) print(response.choices[0].message.content) ``` ### Environment variable pattern (recommended) Instead of hardcoding the URL, use env vars so code works with both local and cloud: ```bash # .env.local (local dev) OPENAI_BASE_URL=http://localhost:11434/v1 OPENAI_API_KEY=ollama LLM_MODEL=llama3.1:8b # .env.production OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-... LLM_MODEL=gpt-4o ``` ```typescript const client = new OpenAI({ baseURL: process.env.OPENAI_BASE_URL, apiKey: process.env.OPENAI_API_KEY, }); ``` --- ## Running Extraction Service Evals Locally The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed. ```bash cd services/extraction-service # Run evals with default model (llama3.1:8b) pnpm eval:ollama # Run with a different model OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama # Run unattended with logging + macOS notification on completion OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh ``` Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details. --- ## Pointing the Extraction Service Python Sidecar at Ollama By default the sidecar uses Gemini. Override with: ```bash export LANGEXTRACT_PROVIDER=openai_compat export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 export LANGEXTRACT_API_KEY=ollama export LANGEXTRACT_MODEL=llama3.1:8b ``` --- ## Model Management ```bash # Pull a new model NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b # See what's loaded in RAM right now ollama ps # Unload a model from RAM (free up memory) curl http://localhost:11434/api/generate \ -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}' # Remove a model from disk ollama rm ``` --- ## Choosing the Right Model | Task | Recommended Model | Why | | -------------------------------- | ------------------- | ----------------------------- | | TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally | | Fast evals / iteration | `llama3.1:8b` | 40–60 tok/s, low RAM | | Structured JSON extraction | `qwen2.5-coder:32b` | Excellent format compliance | | Complex reasoning / triage | `deepseek-r1:32b` | Chain-of-thought, ~80% of 70B | | Quick one-off questions | `llama3.1:8b` | Fastest response | See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table. --- ## Important: JSON Output Always request JSON mode explicitly — models are more reliable with it: ```typescript response_format: { type: 'json_object'; } ``` When parsing in promptfoo assertions, output is a **raw string** — parse it first: ```javascript // ✅ Correct JSON.parse(output).extractions.map(function(e){ return e.extraction_class }) // ❌ Wrong — output is not already an object output.extractions.map(...) ``` ### DeepSeek R1 — strip `` blocks R1 models emit reasoning traces before JSON. Strip them: ```typescript const raw = res.choices[0].message.content; const json = raw.replace(/[\s\S]*?<\/think>/g, '').trim(); const result = JSON.parse(json); ``` --- ## Troubleshooting | Problem | Fix | | ------------------------------------------- | ----------------------------------------------------------------------------------- | | `connection refused` on port 11434 | Run `ollama serve` or `brew services start ollama` | | Model pull fails / hangs | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull ` | | `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically | | Slow responses | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) | | Out of memory | Run `ollama ps` and unload unused models with `keep_alive: "0"` | | JSON parse error with R1 models | Strip `...` block before parsing | See [08-troubleshooting.md](08-troubleshooting.md) for more. --- ## Further Reading | Doc | Contents | | -------------------------------------------------------------------- | ------------------------------------------------------------ | | [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget | | [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md) | Installation, server config, memory management | | [06-extraction-service-evals.md](06-extraction-service-evals.md) | promptfoo eval suite, assertion patterns, latency comparison | | [07-model-recommendations.md](07-model-recommendations.md) | Full model comparison table, gap analysis vs 70B | | [08-troubleshooting.md](08-troubleshooting.md) | Common issues and fixes | | [09-environment-variables.md](09-environment-variables.md) | All config env vars |