saravanakumardb1 acd4c3542b feat(extraction-service): scaffold promptfoo eval suite with 19 test cases

- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API
  covering all 5 built-in tasks (transcript, triage, memory-insight,
  reflection-enrichment, bug-report-extraction)
- Add evals/fixtures/golden.json: machine-readable golden input/output fixtures
- Add evals/run-evals.sh: shell runner with health checks, auth token
  handling, task filtering, and CI mode
- Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration

2026-02-19 12:19:16 -08:00

6.3 KiB

Raw Blame History

Extraction Service — LLM Evals

Quality evals for all 5 built-in extraction tasks using promptfoo.

Structure

evals/
├── promptfoo.yaml          # Main eval config — all test cases + assertions
├── fixtures/
│   └── golden.json         # Golden input/output fixtures (machine-readable)
├── run-evals.sh            # Shell runner with health check + auth guard
└── README.md               # This file

Prerequisites

extraction-service running on port 4005:
```
pnpm dev
```

Auth token from platform-service:

# POST /api/auth/login → copy the token
export EXTRACTION_EVAL_TOKEN=<your-jwt>

promptfoo (installed automatically via npx, or globally):
```
npm install -g promptfoo
```

Running Evals

# All tasks
pnpm eval

# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction

# CI mode (exits non-zero on any failure)
pnpm eval:ci

# JSON output (for dashboards / reporting)
pnpm eval:json

What's Tested

Task	Cases	Key Assertions
`transcript-extraction`	4	action_item, deadline, person, decision, question classes present
`triage`	5	brain_signal routing (health/work/money), emotion valence, date_reference
`memory-insight`	4	pattern frequency, relationship, milestone, recurring_theme
`reflection-enrichment`	4	emotional_state valence, accomplishment, concern, goal_progress
`bug-report-extraction`	2	all 5 fields extracted, severity level attribute

Total: 19 eval cases, 60+ assertions

Adding New Cases

Add a test case to promptfoo.yaml under tests::

- description: 'triage: home brain signal for household content'
  vars:
    taskId: triage
    text: 'Need to fix the leaking faucet in the kitchen this weekend.'
  assert:
    - type: javascript
      value: output.classes.includes('action')
    - type: javascript
      value: |
        output.extractions.some(e =>
          e.extraction_class === 'brain_signal' &&
          e.attributes?.brain === 'home'
        )

Optionally add the fixture to fixtures/golden.json for machine-readable tracking.

Local OSS Models (Ollama)

Run the same 19 eval cases against a local model — zero API cost, fully private.

Install Ollama

# 1. Install
brew install ollama

# 2. Start server (keep this terminal open)
ollama serve

# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b

# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b

# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap

Run Ollama Evals

# Default model: llama3.1:8b
pnpm eval:ollama

# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama

Note: Ollama evals hit the model directly (no extraction-service needed). Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.

Compare Gemini vs Ollama

Runs both suites and prints a side-by-side pass-rate table:

# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare

# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare

Example output:

  Provider     Passed     Total    Rate     Progress
  ───────────────────────────────────────────────────────
  Gemini       57         60       95%      █████████░
  Ollama       48         60       80%      ████████░░

  Per-task breakdown:
  Task                      Gemini       Ollama
  ──────────────────────────────────────────────────
  triage                    20/20 (100%) 16/20 (80%)
  transcript-extraction     15/16 (94%)  12/16 (75%)
  reflection-enrichment     12/12 (100%) 10/12 (83%)
  memory-insight            8/8 (100%)   7/8 (88%)
  bug-report-extraction     2/4 (50%)    3/4 (75%)

Supported Local Models

Model	Pull command	RAM needed	Notes
`llama3.1:8b`	`ollama pull llama3.1:8b`	~6GB	Best default
`qwen2.5:7b`	`ollama pull qwen2.5:7b`	~5GB	Strong JSON output
`phi4:14b`	`ollama pull phi4`	~9GB	Good reasoning
`llama3.3:70b`	`ollama pull llama3.3:70b`	~40GB	Best quality, needs M2 Max+

CI Integration

Add to your GitHub Actions workflow:

- name: Run extraction evals
  env:
    EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
    EXTRACTION_SERVICE_URL: http://localhost:4005
  run: pnpm --filter @lysnrai/extraction-service eval:ci

The service must be started before running evals in CI.

Interpreting Results

promptfoo outputs a table showing pass/fail per assertion. A case fails if any assertion fails.

Common failure patterns:

Missing class — model didn't extract that entity type; consider adding more examples to the task seed
Wrong attribute — brain_signal.brain or emotion.valence incorrect; refine the task prompt in seed.ts
Latency > 15s — sidecar overloaded or model cold-starting; check Python sidecar logs

Thresholds

Latency: 15s max per extraction (default; adjust threshold in promptfoo.yaml)
Pass rate target: 100% for golden cases (these are deterministic enough inputs)
LLM-as-judge: Not yet implemented — add when you have enough production data to define rubrics

6.3 KiB Raw Blame History