# 06 — Extraction Service Evals

> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.

---

## Overview

The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:

- **Zero-cost iteration** on extraction prompts
- **Side-by-side comparison** of local vs cloud model quality
- **Offline development** when cloud APIs are unavailable

---

## Files

| File                                                      | Purpose                                            |
| --------------------------------------------------------- | -------------------------------------------------- |
| `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
| `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
| `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
| `services/extraction-service/evals/README.md`             | Full usage docs                                    |

---

## Running Evals

```bash
cd services/extraction-service

# Ollama only (no extraction-service needed)
pnpm eval:ollama

# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama

# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
```

### Prerequisites

- Ollama must be running (`ollama serve`)
- A model must be available (`ollama pull llama3.1:8b`)
- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set

---

## Eval Coverage

| Task                    | Cases | Key Assertions                                                  |
| ----------------------- | ----- | --------------------------------------------------------------- |
| `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |

**Total: 19 cases, 50+ assertions**

---

## Important: Assertion Pattern

Ollama returns a raw JSON **string** — assertions must be single **expressions** (no `const`/`return` blocks):

```yaml
# ✅ Correct — single expression with function(e){return ...}
- type: javascript
  value: "JSON.parse(output).extractions.map(function(e){return e.extraction_class}).includes('action')"

# ❌ Wrong — statement block causes SyntaxError: Unexpected token 'return'
- type: javascript
  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
```

### DeepSeek R1 — Strip `<think>` blocks

R1 models emit reasoning traces before JSON. Add a provider-level transform:

```yaml
providers:
  - id: openai:chat:deepseek-r1:32b
    config:
      apiBaseUrl: http://localhost:11434/v1
      apiKey: ollama
      transform: "output.replace(/<think>[\\s\\S]*?<\/think>/g, '').trim()"
```

---

## Pointing the Python Sidecar at Ollama

The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:

```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```

> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.

---

## Latency & Cost Comparison

Measured on M4 Pro 48 GB, 19 eval cases, concurrency=4:

| Provider         | Model               | Duration (19 cases) | Per case avg | Tokens | Cost per run  |
| ---------------- | ------------------- | ------------------- | ------------ | ------ | ------------- |
| **Ollama local** | `llama3.1:8b`       | ~1m 27s             | ~4.6s        | ~7,300 | **$0.00**     |
| **Ollama local** | `qwen2.5-coder:32b` | ~5–8m (est.)        | ~15–25s      | ~7,300 | **$0.00**     |
| **Ollama local** | `deepseek-r1:32b`   | ~5–8m (est.)        | ~15–25s      | ~7,300 | **$0.00**     |
| **Google Cloud** | `gemini-2.5-flash`  | ~15–25s             | ~1s          | ~7,300 | ~$0.003–0.005 |
| **Azure OpenAI** | `gpt-4o`            | ~20–40s             | ~1–2s        | ~7,300 | ~$0.05–0.15   |

**Key takeaway:** Cloud models are 5–6x faster per request due to massive parallel GPU infrastructure. Local wins on cost, privacy, and no proxy/quota issues.

---

## Recommended Models for Evals

| Model               | Disk  | JSON Quality | Reasoning | Speed    | Pass Rate | Notes                                            |
| ------------------- | ----- | ------------ | --------- | -------- | --------- | ------------------------------------------------ |
| `llama3.1:8b`       | 4.9GB | Good         | Basic     | Fast     | 19/19 ✅  | Default — tuned assertions for 8B behavior gaps  |
| `qwen2.5-coder:32b` | 19GB  | Excellent    | Good      | Moderate | TBD       | Best JSON compliance, strong structured output   |
| `deepseek-r1:32b`   | 20GB  | Good\*       | Excellent | Moderate | TBD       | \*Requires `<think>` strip transform — see above |
| `qwen2.5:7b`        | 5GB   | Excellent    | Basic     | Fast     | TBD       | Best JSON structure at 7B size                   |
| `phi4:14b`          | 9GB   | Good         | Good      | Fast     | TBD       | Strong reasoning for triage tasks                |

Run any model with:

```bash
OLLAMA_MODEL=<model> ./evals/run-ollama-evals-logged.sh
```