- docs/02-ollama-setup-and-models.md: installation, server config, memory management, idle timeout, manual load/unload, OpenAI-compatible API, native API reference, performance tuning flags (flash attention, KV cache) - docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases across 5 tasks, assertion patterns for JSON string output, Python sidecar config - docs/09-environment-variables.md: comprehensive var reference for Ollama server, evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
4.3 KiB
4.3 KiB
06 — Extraction Service Evals
Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
Overview
The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
- Zero-cost iteration on extraction prompts
- Side-by-side comparison of local vs cloud model quality
- Offline development when cloud APIs are unavailable
Files
| File | Purpose |
|---|---|
services/extraction-service/evals/promptfoo.yaml |
Gemini evals (via extraction-service HTTP API) |
services/extraction-service/evals/promptfoo.ollama.yaml |
Same 19 cases, hits Ollama directly |
services/extraction-service/evals/compare-evals.sh |
Side-by-side Gemini vs Ollama pass-rate comparison |
services/extraction-service/evals/fixtures/golden.json |
Machine-readable golden fixtures |
services/extraction-service/evals/README.md |
Full usage docs |
Running Evals
cd services/extraction-service
# Ollama only (no extraction-service needed)
pnpm eval:ollama
# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
Prerequisites
- Ollama must be running (
ollama serve) - A model must be available (
ollama pull llama3.1:8b) - For comparison: extraction-service must be running with
EXTRACTION_EVAL_TOKENset
Eval Coverage
| Task | Cases | Key Assertions |
|---|---|---|
transcript-extraction |
4 | action_item, deadline, person, decision, question |
triage |
5 | brain_signal routing (health/work/money), emotion valence |
memory-insight |
4 | pattern frequency, relationship, milestone, recurring_theme |
reflection-enrichment |
4 | emotional_state valence, accomplishment, concern, goal_progress |
bug-report-extraction |
2 | all 5 fields, severity level attribute |
Total: 19 cases, 50+ assertions
Important: Assertion Pattern
Ollama returns a raw JSON string — assertions must parse it inline:
# ✅ Correct — parse the string first
- type: javascript
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
# ❌ Wrong — output is a string, not an object
- type: javascript
value: output.classes.includes('action')
Pointing the Python Sidecar at Ollama
The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
Check
services/extraction-service/python/for exact env var names — the sidecar config may use different keys depending on LangExtract version.
Cost Comparison
| Provider | Cost per full run | Notes |
|---|---|---|
| Gemini (via extraction-service) | ~$0.003–0.005 | gemini-2.5-flash |
| Ollama (local) | $0.00 | Fully offline after model download |
Recommended Models for Evals
| Model | JSON Quality | Speed | Notes |
|---|---|---|---|
llama3.1:8b |
Good | Fast | Default, reliable JSON output |
qwen2.5:7b |
Excellent | Fast | Best JSON structure compliance |
qwen2.5-coder:32b |
Excellent | Moderate | Best quality, slower |
phi4 |
Good | Fast | Good reasoning for triage tasks |