- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API covering all 5 built-in tasks (transcript, triage, memory-insight, reflection-enrichment, bug-report-extraction) - Add evals/fixtures/golden.json: machine-readable golden input/output fixtures - Add evals/run-evals.sh: shell runner with health checks, auth token handling, task filtering, and CI mode - Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
6.3 KiB
Extraction Service — LLM Evals
Quality evals for all 5 built-in extraction tasks using promptfoo.
Structure
evals/
├── promptfoo.yaml # Main eval config — all test cases + assertions
├── fixtures/
│ └── golden.json # Golden input/output fixtures (machine-readable)
├── run-evals.sh # Shell runner with health check + auth guard
└── README.md # This file
Prerequisites
-
extraction-service running on port 4005:
pnpm dev -
Auth token from platform-service:
# POST /api/auth/login → copy the token export EXTRACTION_EVAL_TOKEN=<your-jwt> -
promptfoo (installed automatically via
npx, or globally):npm install -g promptfoo
Running Evals
# All tasks
pnpm eval
# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction
# CI mode (exits non-zero on any failure)
pnpm eval:ci
# JSON output (for dashboards / reporting)
pnpm eval:json
What's Tested
| Task | Cases | Key Assertions |
|---|---|---|
transcript-extraction |
4 | action_item, deadline, person, decision, question classes present |
triage |
5 | brain_signal routing (health/work/money), emotion valence, date_reference |
memory-insight |
4 | pattern frequency, relationship, milestone, recurring_theme |
reflection-enrichment |
4 | emotional_state valence, accomplishment, concern, goal_progress |
bug-report-extraction |
2 | all 5 fields extracted, severity level attribute |
Total: 19 eval cases, 60+ assertions
Adding New Cases
-
Add a test case to
promptfoo.yamlundertests::- description: 'triage: home brain signal for household content' vars: taskId: triage text: 'Need to fix the leaking faucet in the kitchen this weekend.' assert: - type: javascript value: output.classes.includes('action') - type: javascript value: | output.extractions.some(e => e.extraction_class === 'brain_signal' && e.attributes?.brain === 'home' ) -
Optionally add the fixture to
fixtures/golden.jsonfor machine-readable tracking.
Local OSS Models (Ollama)
Run the same 19 eval cases against a local model — zero API cost, fully private.
Install Ollama
# 1. Install
brew install ollama
# 2. Start server (keep this terminal open)
ollama serve
# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b
# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
Run Ollama Evals
# Default model: llama3.1:8b
pnpm eval:ollama
# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
Note: Ollama evals hit the model directly (no extraction-service needed). Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
Compare Gemini vs Ollama
Runs both suites and prints a side-by-side pass-rate table:
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare
# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
Example output:
Provider Passed Total Rate Progress
───────────────────────────────────────────────────────
Gemini 57 60 95% █████████░
Ollama 48 60 80% ████████░░
Per-task breakdown:
Task Gemini Ollama
──────────────────────────────────────────────────
triage 20/20 (100%) 16/20 (80%)
transcript-extraction 15/16 (94%) 12/16 (75%)
reflection-enrichment 12/12 (100%) 10/12 (83%)
memory-insight 8/8 (100%) 7/8 (88%)
bug-report-extraction 2/4 (50%) 3/4 (75%)
Supported Local Models
| Model | Pull command | RAM needed | Notes |
|---|---|---|---|
llama3.1:8b |
ollama pull llama3.1:8b |
~6GB | Best default |
qwen2.5:7b |
ollama pull qwen2.5:7b |
~5GB | Strong JSON output |
phi4:14b |
ollama pull phi4 |
~9GB | Good reasoning |
llama3.3:70b |
ollama pull llama3.3:70b |
~40GB | Best quality, needs M2 Max+ |
CI Integration
Add to your GitHub Actions workflow:
- name: Run extraction evals
env:
EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
EXTRACTION_SERVICE_URL: http://localhost:4005
run: pnpm --filter @lysnrai/extraction-service eval:ci
The service must be started before running evals in CI.
Interpreting Results
promptfoo outputs a table showing pass/fail per assertion. A case fails if any assertion fails.
Common failure patterns:
- Missing class — model didn't extract that entity type; consider adding more examples to the task seed
- Wrong attribute —
brain_signal.brainoremotion.valenceincorrect; refine the task prompt inseed.ts - Latency > 15s — sidecar overloaded or model cold-starting; check Python sidecar logs
Thresholds
- Latency: 15s max per extraction (default; adjust
thresholdinpromptfoo.yaml) - Pass rate target: 100% for golden cases (these are deterministic enough inputs)
- LLM-as-judge: Not yet implemented — add when you have enough production data to define rubrics