# 06 — Extraction Service Evals

> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.

---

## Overview

The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:

- **Zero-cost iteration** on extraction prompts
- **Side-by-side comparison** of local vs cloud model quality
- **Offline development** when cloud APIs are unavailable

---

## Files

| File                                                      | Purpose                                            |
| --------------------------------------------------------- | -------------------------------------------------- |
| `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
| `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
| `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
| `services/extraction-service/evals/README.md`             | Full usage docs                                    |

---

## Running Evals

```bash
cd services/extraction-service

# Ollama only (no extraction-service needed)
pnpm eval:ollama

# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama

# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
```

### Prerequisites

- Ollama must be running (`ollama serve`)
- A model must be available (`ollama pull llama3.1:8b`)
- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set

---

## Eval Coverage

| Task                    | Cases | Key Assertions                                                  |
| ----------------------- | ----- | --------------------------------------------------------------- |
| `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |

**Total: 19 cases, 50+ assertions**

---

## Important: Assertion Pattern

Ollama returns a raw JSON **string** — assertions must parse it inline:

```yaml
# ✅ Correct — parse the string first
- type: javascript
  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"

# ❌ Wrong — output is a string, not an object
- type: javascript
  value: output.classes.includes('action')
```

---

## Pointing the Python Sidecar at Ollama

The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:

```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```

> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.

---

## Cost Comparison

| Provider                            | Cost per full run | Notes                              |
| ----------------------------------- | ----------------- | ---------------------------------- |
| **Gemini** (via extraction-service) | ~$0.003–0.005     | gemini-2.5-flash                   |
| **Ollama** (local)                  | $0.00             | Fully offline after model download |

---

## Recommended Models for Evals

| Model               | JSON Quality | Speed    | Notes                           |
| ------------------- | ------------ | -------- | ------------------------------- |
| `llama3.1:8b`       | Good         | Fast     | Default, reliable JSON output   |
| `qwen2.5:7b`        | Excellent    | Fast     | Best JSON structure compliance  |
| `qwen2.5-coder:32b` | Excellent    | Moderate | Best quality, slower            |
| `phi4`              | Good         | Fast     | Good reasoning for triage tasks |