# 06 — Extraction Service Evals > Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation. --- ## Overview The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables: - **Zero-cost iteration** on extraction prompts - **Side-by-side comparison** of local vs cloud model quality - **Offline development** when cloud APIs are unavailable --- ## Files | File | Purpose | | --------------------------------------------------------- | -------------------------------------------------- | | `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) | | `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly | | `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison | | `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures | | `services/extraction-service/evals/README.md` | Full usage docs | --- ## Running Evals ```bash cd services/extraction-service # Ollama only (no extraction-service needed) pnpm eval:ollama # Different model OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama # Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN) pnpm eval:compare ``` ### Prerequisites - Ollama must be running (`ollama serve`) - A model must be available (`ollama pull llama3.1:8b`) - For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set --- ## Eval Coverage | Task | Cases | Key Assertions | | ----------------------- | ----- | --------------------------------------------------------------- | | `transcript-extraction` | 4 | action_item, deadline, person, decision, question | | `triage` | 5 | brain_signal routing (health/work/money), emotion valence | | `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | | `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | | `bug-report-extraction` | 2 | all 5 fields, severity level attribute | **Total: 19 cases, 50+ assertions** --- ## Important: Assertion Pattern Ollama returns a raw JSON **string** — assertions must be single **expressions** (no `const`/`return` blocks): ```yaml # ✅ Correct — single expression with function(e){return ...} - type: javascript value: "JSON.parse(output).extractions.map(function(e){return e.extraction_class}).includes('action')" # ❌ Wrong — statement block causes SyntaxError: Unexpected token 'return' - type: javascript value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');" ``` ### DeepSeek R1 — Strip `` blocks R1 models emit reasoning traces before JSON. Add a provider-level transform: ```yaml providers: - id: openai:chat:deepseek-r1:32b config: apiBaseUrl: http://localhost:11434/v1 apiKey: ollama transform: "output.replace(/[\\s\\S]*?<\/think>/g, '').trim()" ``` --- ## Pointing the Python Sidecar at Ollama The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev: ```bash export LANGEXTRACT_PROVIDER=openai_compat export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 export LANGEXTRACT_API_KEY=ollama export LANGEXTRACT_MODEL=llama3.1:8b ``` > Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version. --- ## Latency & Cost Comparison Measured on M4 Pro 48 GB, 19 eval cases, concurrency=4: | Provider | Model | Duration (19 cases) | Per case avg | Tokens | Cost per run | | ---------------- | ------------------- | ------------------- | ------------ | ------ | ------------- | | **Ollama local** | `llama3.1:8b` | ~1m 27s | ~4.6s | ~7,300 | **$0.00** | | **Ollama local** | `qwen2.5-coder:32b` | ~5–8m (est.) | ~15–25s | ~7,300 | **$0.00** | | **Ollama local** | `deepseek-r1:32b` | ~5–8m (est.) | ~15–25s | ~7,300 | **$0.00** | | **Google Cloud** | `gemini-2.5-flash` | ~15–25s | ~1s | ~7,300 | ~$0.003–0.005 | | **Azure OpenAI** | `gpt-4o` | ~20–40s | ~1–2s | ~7,300 | ~$0.05–0.15 | **Key takeaway:** Cloud models are 5–6x faster per request due to massive parallel GPU infrastructure. Local wins on cost, privacy, and no proxy/quota issues. --- ## Recommended Models for Evals | Model | Disk | JSON Quality | Reasoning | Speed | Pass Rate | Notes | | ------------------- | ----- | ------------ | --------- | -------- | --------- | ------------------------------------------------ | | `llama3.1:8b` | 4.9GB | Good | Basic | Fast | 19/19 ✅ | Default — tuned assertions for 8B behavior gaps | | `qwen2.5-coder:32b` | 19GB | Excellent | Good | Moderate | TBD | Best JSON compliance, strong structured output | | `deepseek-r1:32b` | 20GB | Good\* | Excellent | Moderate | TBD | \*Requires `` strip transform — see above | | `qwen2.5:7b` | 5GB | Excellent | Basic | Fast | TBD | Best JSON structure at 7B size | | `phi4:14b` | 9GB | Good | Good | Fast | TBD | Strong reasoning for triage tasks | Run any model with: ```bash OLLAMA_MODEL= ./evals/run-ollama-evals-logged.sh ```