- Add Latency & Cost Comparison table: llama3.1:8b (~1m27s), qwen2.5-coder:32b (~5-8m est.), deepseek-r1:32b (~5-8m est.) vs gemini-2.5-flash (~15-25s, $0.003) and gpt-4o (~20-40s, $0.05-0.15) — all measured at 19 cases, concurrency=4 - Fix assertion pattern docs: single expressions required, not const/return blocks - Add deepseek-r1 <think> strip transform pattern for promptfoo provider config - Expand recommended models table with Disk, Reasoning, Pass Rate, and Notes columns
5.9 KiB
06 — Extraction Service Evals
Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
Overview
The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
- Zero-cost iteration on extraction prompts
- Side-by-side comparison of local vs cloud model quality
- Offline development when cloud APIs are unavailable
Files
| File | Purpose |
|---|---|
services/extraction-service/evals/promptfoo.yaml |
Gemini evals (via extraction-service HTTP API) |
services/extraction-service/evals/promptfoo.ollama.yaml |
Same 19 cases, hits Ollama directly |
services/extraction-service/evals/compare-evals.sh |
Side-by-side Gemini vs Ollama pass-rate comparison |
services/extraction-service/evals/fixtures/golden.json |
Machine-readable golden fixtures |
services/extraction-service/evals/README.md |
Full usage docs |
Running Evals
cd services/extraction-service
# Ollama only (no extraction-service needed)
pnpm eval:ollama
# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
Prerequisites
- Ollama must be running (
ollama serve) - A model must be available (
ollama pull llama3.1:8b) - For comparison: extraction-service must be running with
EXTRACTION_EVAL_TOKENset
Eval Coverage
| Task | Cases | Key Assertions |
|---|---|---|
transcript-extraction |
4 | action_item, deadline, person, decision, question |
triage |
5 | brain_signal routing (health/work/money), emotion valence |
memory-insight |
4 | pattern frequency, relationship, milestone, recurring_theme |
reflection-enrichment |
4 | emotional_state valence, accomplishment, concern, goal_progress |
bug-report-extraction |
2 | all 5 fields, severity level attribute |
Total: 19 cases, 50+ assertions
Important: Assertion Pattern
Ollama returns a raw JSON string — assertions must be single expressions (no const/return blocks):
# ✅ Correct — single expression with function(e){return ...}
- type: javascript
value: "JSON.parse(output).extractions.map(function(e){return e.extraction_class}).includes('action')"
# ❌ Wrong — statement block causes SyntaxError: Unexpected token 'return'
- type: javascript
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
DeepSeek R1 — Strip <think> blocks
R1 models emit reasoning traces before JSON. Add a provider-level transform:
providers:
- id: openai:chat:deepseek-r1:32b
config:
apiBaseUrl: http://localhost:11434/v1
apiKey: ollama
transform: "output.replace(/<think>[\\s\\S]*?<\/think>/g, '').trim()"
Pointing the Python Sidecar at Ollama
The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
Check
services/extraction-service/python/for exact env var names — the sidecar config may use different keys depending on LangExtract version.
Latency & Cost Comparison
Measured on M4 Pro 48 GB, 19 eval cases, concurrency=4:
| Provider | Model | Duration (19 cases) | Per case avg | Tokens | Cost per run |
|---|---|---|---|---|---|
| Ollama local | llama3.1:8b |
~1m 27s | ~4.6s | ~7,300 | $0.00 |
| Ollama local | qwen2.5-coder:32b |
~5–8m (est.) | ~15–25s | ~7,300 | $0.00 |
| Ollama local | deepseek-r1:32b |
~5–8m (est.) | ~15–25s | ~7,300 | $0.00 |
| Google Cloud | gemini-2.5-flash |
~15–25s | ~1s | ~7,300 | ~$0.003–0.005 |
| Azure OpenAI | gpt-4o |
~20–40s | ~1–2s | ~7,300 | ~$0.05–0.15 |
Key takeaway: Cloud models are 5–6x faster per request due to massive parallel GPU infrastructure. Local wins on cost, privacy, and no proxy/quota issues.
Recommended Models for Evals
| Model | Disk | JSON Quality | Reasoning | Speed | Pass Rate | Notes |
|---|---|---|---|---|---|---|
llama3.1:8b |
4.9GB | Good | Basic | Fast | 19/19 ✅ | Default — tuned assertions for 8B behavior gaps |
qwen2.5-coder:32b |
19GB | Excellent | Good | Moderate | TBD | Best JSON compliance, strong structured output |
deepseek-r1:32b |
20GB | Good* | Excellent | Moderate | TBD | *Requires <think> strip transform — see above |
qwen2.5:7b |
5GB | Excellent | Basic | Fast | TBD | Best JSON structure at 7B size |
phi4:14b |
9GB | Good | Good | Fast | TBD | Strong reasoning for triage tasks |
Run any model with:
OLLAMA_MODEL=<model> ./evals/run-ollama-evals-logged.sh