saravanakumardb1 cfc1194079 docs(local-llms): add latency/cost comparison and deepseek-r1 transform pattern to evals doc

- Add Latency & Cost Comparison table: llama3.1:8b (~1m27s), qwen2.5-coder:32b
  (~5-8m est.), deepseek-r1:32b (~5-8m est.) vs gemini-2.5-flash (~15-25s, $0.003)
  and gpt-4o (~20-40s, $0.05-0.15) — all measured at 19 cases, concurrency=4
- Fix assertion pattern docs: single expressions required, not const/return blocks
- Add deepseek-r1 <think> strip transform pattern for promptfoo provider config
- Expand recommended models table with Disk, Reasoning, Pass Rate, and Notes columns

2026-02-19 16:05:52 -08:00

5.9 KiB

Raw Permalink Blame History

06 — Extraction Service Evals

Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.

Overview

The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:

Zero-cost iteration on extraction prompts
Side-by-side comparison of local vs cloud model quality
Offline development when cloud APIs are unavailable

Files

File	Purpose
`services/extraction-service/evals/promptfoo.yaml`	Gemini evals (via extraction-service HTTP API)
`services/extraction-service/evals/promptfoo.ollama.yaml`	Same 19 cases, hits Ollama directly
`services/extraction-service/evals/compare-evals.sh`	Side-by-side Gemini vs Ollama pass-rate comparison
`services/extraction-service/evals/fixtures/golden.json`	Machine-readable golden fixtures
`services/extraction-service/evals/README.md`	Full usage docs

Running Evals

cd services/extraction-service

# Ollama only (no extraction-service needed)
pnpm eval:ollama

# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama

# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare

Prerequisites

Ollama must be running (ollama serve)
A model must be available (ollama pull llama3.1:8b)
For comparison: extraction-service must be running with EXTRACTION_EVAL_TOKEN set

Eval Coverage

Task	Cases	Key Assertions
`transcript-extraction`	4	action_item, deadline, person, decision, question
`triage`	5	brain_signal routing (health/work/money), emotion valence
`memory-insight`	4	pattern frequency, relationship, milestone, recurring_theme
`reflection-enrichment`	4	emotional_state valence, accomplishment, concern, goal_progress
`bug-report-extraction`	2	all 5 fields, severity level attribute

Total: 19 cases, 50+ assertions

Important: Assertion Pattern

Ollama returns a raw JSON string — assertions must be single expressions (no const/return blocks):

# ✅ Correct — single expression with function(e){return ...}
- type: javascript
  value: "JSON.parse(output).extractions.map(function(e){return e.extraction_class}).includes('action')"

# ❌ Wrong — statement block causes SyntaxError: Unexpected token 'return'
- type: javascript
  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"

DeepSeek R1 — Strip `<think>` blocks

R1 models emit reasoning traces before JSON. Add a provider-level transform:

providers:
  - id: openai:chat:deepseek-r1:32b
    config:
      apiBaseUrl: http://localhost:11434/v1
      apiKey: ollama
      transform: "output.replace(/<think>[\\s\\S]*?<\/think>/g, '').trim()"

Pointing the Python Sidecar at Ollama

The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:

export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b

Check services/extraction-service/python/ for exact env var names — the sidecar config may use different keys depending on LangExtract version.

Latency & Cost Comparison

Measured on M4 Pro 48 GB, 19 eval cases, concurrency=4:

Provider	Model	Duration (19 cases)	Per case avg	Tokens	Cost per run
Ollama local	`llama3.1:8b`	~1m 27s	~4.6s	~7,300	$0.00
Ollama local	`qwen2.5-coder:32b`	~5–8m (est.)	~15–25s	~7,300	$0.00
Ollama local	`deepseek-r1:32b`	~5–8m (est.)	~15–25s	~7,300	$0.00
Google Cloud	`gemini-2.5-flash`	~15–25s	~1s	~7,300	~$0.003–0.005
Azure OpenAI	`gpt-4o`	~20–40s	~1–2s	~7,300	~$0.05–0.15

Key takeaway: Cloud models are 5–6x faster per request due to massive parallel GPU infrastructure. Local wins on cost, privacy, and no proxy/quota issues.

Recommended Models for Evals

Model	Disk	JSON Quality	Reasoning	Speed	Pass Rate	Notes
`llama3.1:8b`	4.9GB	Good	Basic	Fast	19/19 ✅	Default — tuned assertions for 8B behavior gaps
`qwen2.5-coder:32b`	19GB	Excellent	Good	Moderate	TBD	Best JSON compliance, strong structured output
`deepseek-r1:32b`	20GB	Good*	Excellent	Moderate	TBD	*Requires `<think>` strip transform — see above
`qwen2.5:7b`	5GB	Excellent	Basic	Fast	TBD	Best JSON structure at 7B size
`phi4:14b`	9GB	Good	Good	Fast	TBD	Strong reasoning for triage tasks

Run any model with:

OLLAMA_MODEL=<model> ./evals/run-ollama-evals-logged.sh

5.9 KiB Raw Permalink Blame History Unescape Escape