# 00 — Developer Guide: Local LLM with Ollama

> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.

---

## What Is This?

This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.

**Models installed:**

| Model                | Size   | Best For                                     |
| -------------------- | ------ | -------------------------------------------- |
| `qwen2.5-coder:32b`  | 19 GB  | Code (TS, Python, Swift), structured JSON    |
| `qwen2.5-coder:7b`   | 4.7 GB | Fast code tasks, fits alongside other models |
| `deepseek-r1:32b`    | 19 GB  | Complex reasoning, chain-of-thought          |
| `llama3.1:8b`        | 4.9 GB | Fast evals, general tasks                    |
| `sematre/orpheus:en` | 4 GB   | Text-to-speech (8 voices, emotion tags)      |

---

## Quick Start

### 1. Check Ollama is running

```bash
curl http://localhost:11434/api/tags
```

If it returns a JSON list of models — you're good. If it fails:

```bash
ollama serve          # start in foreground
# or
brew services start ollama   # start as background service
```

### 2. List available models

```bash
ollama list
```

### 3. Chat with a model (interactive)

```bash
ollama run llama3.1:8b
ollama run qwen2.5-coder:32b
```

---

## API Endpoint Reference

| Property            | Value                                 |
| ------------------- | ------------------------------------- |
| **Base URL**        | `http://localhost:11434/v1`           |
| **API Key**         | `ollama` (any non-empty string works) |
| **Protocol**        | OpenAI-compatible REST                |
| **Models endpoint** | `http://localhost:11434/api/tags`     |
| **Loaded models**   | `http://localhost:11434/api/ps`       |

---

## Using in Code

### curl

```bash
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
    "response_format": {"type": "json_object"}
  }'
```

### TypeScript / Node.js (OpenAI SDK)

```typescript
import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await ollama.chat.completions.create({
  model: 'qwen2.5-coder:32b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});

console.log(res.choices[0].message.content);
```

### Python (OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)
```

### Environment variable pattern (recommended)

Instead of hardcoding the URL, use env vars so code works with both local and cloud:

```bash
# .env.local (local dev)
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# .env.production
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
LLM_MODEL=gpt-4o
```

```typescript
const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL,
  apiKey: process.env.OPENAI_API_KEY,
});
```

---

## Running Extraction Service Evals Locally

The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.

```bash
cd services/extraction-service

# Run evals with default model (llama3.1:8b)
pnpm eval:ollama

# Run with a different model
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama

# Run unattended with logging + macOS notification on completion
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
```

Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.

---

## Pointing the Extraction Service Python Sidecar at Ollama

By default the sidecar uses Gemini. Override with:

```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```

---

## Model Management

```bash
# Pull a new model
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b

# See what's loaded in RAM right now
ollama ps

# Unload a model from RAM (free up memory)
curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'

# Remove a model from disk
ollama rm <model>
```

---

## Choosing the Right Model

| Task                             | Recommended Model   | Why                           |
| -------------------------------- | ------------------- | ----------------------------- |
| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally     |
| Fast evals / iteration           | `llama3.1:8b`       | 40–60 tok/s, low RAM          |
| Structured JSON extraction       | `qwen2.5-coder:32b` | Excellent format compliance   |
| Complex reasoning / triage       | `deepseek-r1:32b`   | Chain-of-thought, ~80% of 70B |
| Quick one-off questions          | `llama3.1:8b`       | Fastest response              |

See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.

---

## Important: JSON Output

Always request JSON mode explicitly — models are more reliable with it:

```typescript
response_format: {
  type: 'json_object';
}
```

When parsing in promptfoo assertions, output is a **raw string** — parse it first:

```javascript
// ✅ Correct
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })

// ❌ Wrong — output is not already an object
output.extractions.map(...)
```

### DeepSeek R1 — strip `<think>` blocks

R1 models emit reasoning traces before JSON. Strip them:

```typescript
const raw = res.choices[0].message.content;
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
const result = JSON.parse(json);
```

---

## Troubleshooting

| Problem                                     | Fix                                                                                 |
| ------------------------------------------- | ----------------------------------------------------------------------------------- |
| `connection refused` on port 11434          | Run `ollama serve` or `brew services start ollama`                                  |
| Model pull fails / hangs                    | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>`                  |
| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically                                      |
| Slow responses                              | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) |
| Out of memory                               | Run `ollama ps` and unload unused models with `keep_alive: "0"`                     |
| JSON parse error with R1 models             | Strip `<think>...</think>` block before parsing                                     |

See [08-troubleshooting.md](08-troubleshooting.md) for more.

---

## Further Reading

| Doc                                                                  | Contents                                                     |
| -------------------------------------------------------------------- | ------------------------------------------------------------ |
| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget                                |
| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md)       | Installation, server config, memory management               |
| [06-extraction-service-evals.md](06-extraction-service-evals.md)     | promptfoo eval suite, assertion patterns, latency comparison |
| [07-model-recommendations.md](07-model-recommendations.md)           | Full model comparison table, gap analysis vs 70B             |
| [08-troubleshooting.md](08-troubleshooting.md)                       | Common issues and fixes                                      |
| [09-environment-variables.md](09-environment-variables.md)           | All config env vars                                          |