feat(extraction-service): scaffold promptfoo eval suite with 19 test cases

- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API covering all 5 built-in tasks (transcript, triage, memory-insight, reflection-enrichment, bug-report-extraction) - Add evals/fixtures/golden.json: machine-readable golden input/output fixtures - Add evals/run-evals.sh: shell runner with health checks, auth token handling, task filtering, and CI mode - Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
2026-02-19 12:19:16 -08:00 · 2026-02-19 12:19:16 -08:00 · acd4c3542b
commit acd4c3542b
parent 4a659bf107
4 changed files with 779 additions and 0 deletions
--- a/services/extraction-service/evals/README.md
+++ b/services/extraction-service/evals/README.md
@ -0,0 +1,194 @@
+# Extraction Service — LLM Evals
+
+Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).
+
+## Structure
+
+```
+evals/
+├── promptfoo.yaml          # Main eval config — all test cases + assertions
+├── fixtures/
+│   └── golden.json         # Golden input/output fixtures (machine-readable)
+├── run-evals.sh            # Shell runner with health check + auth guard
+└── README.md               # This file
+```
+
+## Prerequisites
+
+1. **extraction-service running** on port 4005:
+
+   ```bash
+   pnpm dev
+   ```
+
+2. **Auth token** from platform-service:
+
+   ```bash
+   # POST /api/auth/login → copy the token
+   export EXTRACTION_EVAL_TOKEN=<your-jwt>
+   ```
+
+3. **promptfoo** (installed automatically via `npx`, or globally):
+   ```bash
+   npm install -g promptfoo
+   ```
+
+## Running Evals
+
+```bash
+# All tasks
+pnpm eval
+
+# Single task
+pnpm eval:task triage
+pnpm eval:task transcript-extraction
+pnpm eval:task memory-insight
+pnpm eval:task reflection-enrichment
+pnpm eval:task bug-report-extraction
+
+# CI mode (exits non-zero on any failure)
+pnpm eval:ci
+
+# JSON output (for dashboards / reporting)
+pnpm eval:json
+```
+
+## What's Tested
+
+| Task                    | Cases | Key Assertions                                                            |
+| ----------------------- | ----- | ------------------------------------------------------------------------- |
+| `transcript-extraction` | 4     | action_item, deadline, person, decision, question classes present         |
+| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence, date_reference |
+| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme               |
+| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress           |
+| `bug-report-extraction` | 2     | all 5 fields extracted, severity level attribute                          |
+
+**Total: 19 eval cases, 60+ assertions**
+
+## Adding New Cases
+
+1. Add a test case to `promptfoo.yaml` under `tests:`:
+
+   ```yaml
+   - description: 'triage: home brain signal for household content'
+     vars:
+       taskId: triage
+       text: 'Need to fix the leaking faucet in the kitchen this weekend.'
+     assert:
+       - type: javascript
+         value: output.classes.includes('action')
+       - type: javascript
+         value: |
+           output.extractions.some(e =>
+             e.extraction_class === 'brain_signal' &&
+             e.attributes?.brain === 'home'
+           )
+   ```
+
+2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.
+
+## Local OSS Models (Ollama)
+
+Run the same 19 eval cases against a local model — zero API cost, fully private.
+
+### Install Ollama
+
+```bash
+# 1. Install
+brew install ollama
+
+# 2. Start server (keep this terminal open)
+ollama serve
+
+# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
+ollama pull llama3.1:8b
+
+# If corp proxy blocks it, try bypassing:
+HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
+
+# 4. Verify
+ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
+```
+
+### Run Ollama Evals
+
+```bash
+# Default model: llama3.1:8b
+pnpm eval:ollama
+
+# Different model (once pulled)
+OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
+OLLAMA_MODEL=phi4:14b pnpm eval:ollama
+```
+
+> **Note:** Ollama evals hit the model directly (no extraction-service needed).
+> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
+
+### Compare Gemini vs Ollama
+
+Runs both suites and prints a side-by-side pass-rate table:
+
+```bash
+# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
+pnpm eval:compare
+
+# With a different local model
+OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
+```
+
+Example output:
+
+```
+  Provider     Passed     Total    Rate     Progress
+  ───────────────────────────────────────────────────────
+  Gemini       57         60       95%      █████████░
+  Ollama       48         60       80%      ████████░░
+
+  Per-task breakdown:
+  Task                      Gemini       Ollama
+  ──────────────────────────────────────────────────
+  triage                    20/20 (100%) 16/20 (80%)
+  transcript-extraction     15/16 (94%)  12/16 (75%)
+  reflection-enrichment     12/12 (100%) 10/12 (83%)
+  memory-insight            8/8 (100%)   7/8 (88%)
+  bug-report-extraction     2/4 (50%)    3/4 (75%)
+```
+
+### Supported Local Models
+
+| Model          | Pull command               | RAM needed | Notes                       |
+| -------------- | -------------------------- | ---------- | --------------------------- |
+| `llama3.1:8b`  | `ollama pull llama3.1:8b`  | ~6GB       | Best default                |
+| `qwen2.5:7b`   | `ollama pull qwen2.5:7b`   | ~5GB       | Strong JSON output          |
+| `phi4:14b`     | `ollama pull phi4`         | ~9GB       | Good reasoning              |
+| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB      | Best quality, needs M2 Max+ |
+
+## CI Integration
+
+Add to your GitHub Actions workflow:
+
+```yaml
+- name: Run extraction evals
+  env:
+    EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
+    EXTRACTION_SERVICE_URL: http://localhost:4005
+  run: pnpm --filter @lysnrai/extraction-service eval:ci
+```
+
+The service must be started before running evals in CI.
+
+## Interpreting Results
+
+promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.
+
+Common failure patterns:
+
+- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
+- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
+- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs
+
+## Thresholds
+
+- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
+- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
+- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics
--- a/services/extraction-service/evals/fixtures/golden.json
+++ b/services/extraction-service/evals/fixtures/golden.json
@ -0,0 +1,177 @@
+{
+  "description": "Golden fixtures for extraction-service evals. Each entry defines the minimum expected extraction classes for a given input. Used by the promptfoo eval suite and can also be consumed by custom assertion scripts.",
+  "version": "1.0.0",
+  "tasks": {
+    "transcript-extraction": {
+      "expectedClasses": ["action_item", "decision", "question", "deadline", "person", "topic"],
+      "cases": [
+        {
+          "id": "tc-001",
+          "input": "John said we need to ship the feature by Friday. Sarah agreed to handle the testing.",
+          "mustContainClasses": ["action_item", "deadline", "person"],
+          "mustContainText": ["friday", "sarah", "john"],
+          "minExtractions": 3
+        },
+        {
+          "id": "tc-002",
+          "input": "The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.",
+          "mustContainClasses": ["decision", "action_item", "person", "deadline"],
+          "mustContainText": ["q3", "alice"],
+          "minExtractions": 3
+        },
+        {
+          "id": "tc-003",
+          "input": "Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.",
+          "mustContainClasses": ["question", "person"],
+          "mustContainText": ["bob"],
+          "minExtractions": 2
+        },
+        {
+          "id": "tc-004",
+          "input": "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?",
+          "mustContainClasses": ["action_item", "person"],
+          "mustContainText": ["maria", "tom"],
+          "minExtractions": 3
+        }
+      ]
+    },
+    "triage": {
+      "expectedClasses": ["topic", "entity", "action", "emotion", "date_reference", "brain_signal"],
+      "cases": [
+        {
+          "id": "tr-001",
+          "input": "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost.",
+          "mustContainClasses": ["action", "date_reference", "emotion"],
+          "mustContainBrainSignal": { "brain": "health", "minConfidence": 0.5 },
+          "minExtractions": 3
+        },
+        {
+          "id": "tr-002",
+          "input": "Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.",
+          "mustContainClasses": ["action", "date_reference"],
+          "mustContainBrainSignal": { "brain": "work", "minConfidence": 0.5 },
+          "minExtractions": 2
+        },
+        {
+          "id": "tr-003",
+          "input": "I need to pay the credit card bill before the 15th or I'll get charged interest.",
+          "mustContainClasses": ["action", "date_reference"],
+          "mustContainBrainSignal": { "brain": "money", "minConfidence": 0.5 },
+          "minExtractions": 2
+        },
+        {
+          "id": "tr-004",
+          "input": "Feeling really overwhelmed today. Too many things on my plate and I can't focus.",
+          "mustContainClasses": ["emotion"],
+          "mustContainEmotionValence": "negative",
+          "minExtractions": 1
+        },
+        {
+          "id": "tr-005",
+          "input": "Doctor said I need to exercise more. Also need to check my 401k contributions before year end.",
+          "mustContainClasses": ["action"],
+          "mustContainBrainSignals": [
+            { "brain": "health", "minConfidence": 0.4 },
+            { "brain": "money", "minConfidence": 0.4 }
+          ],
+          "minExtractions": 2
+        }
+      ]
+    },
+    "memory-insight": {
+      "expectedClasses": ["pattern", "recurring_theme", "relationship", "milestone"],
+      "cases": [
+        {
+          "id": "mi-001",
+          "input": "Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.",
+          "mustContainClasses": ["pattern"],
+          "mustContainPatternFrequency": "recurring",
+          "minExtractions": 1
+        },
+        {
+          "id": "mi-002",
+          "input": "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop.",
+          "mustContainClasses": ["pattern", "relationship"],
+          "minExtractions": 2
+        },
+        {
+          "id": "mi-003",
+          "input": "Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.",
+          "mustContainClasses": ["milestone"],
+          "minExtractions": 2
+        },
+        {
+          "id": "mi-004",
+          "input": "Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.",
+          "mustContainClasses": ["recurring_theme"],
+          "minExtractions": 1
+        }
+      ]
+    },
+    "reflection-enrichment": {
+      "expectedClasses": ["emotional_state", "accomplishment", "concern", "goal_progress"],
+      "cases": [
+        {
+          "id": "re-001",
+          "input": "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week.",
+          "mustContainClasses": ["accomplishment", "concern", "emotional_state"],
+          "minExtractions": 3
+        },
+        {
+          "id": "re-002",
+          "input": "Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.",
+          "mustContainClasses": ["emotional_state", "accomplishment"],
+          "mustContainEmotionValence": "positive",
+          "minExtractions": 2
+        },
+        {
+          "id": "re-003",
+          "input": "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month.",
+          "mustContainClasses": ["goal_progress", "accomplishment"],
+          "minExtractions": 2
+        },
+        {
+          "id": "re-004",
+          "input": "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up.",
+          "mustContainClasses": ["accomplishment", "concern"],
+          "minExtractions": 2
+        }
+      ]
+    },
+    "bug-report-extraction": {
+      "expectedClasses": [
+        "steps_to_reproduce",
+        "expected_behavior",
+        "actual_behavior",
+        "affected_component",
+        "severity"
+      ],
+      "cases": [
+        {
+          "id": "br-001",
+          "input": "When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.",
+          "mustContainClasses": [
+            "steps_to_reproduce",
+            "expected_behavior",
+            "actual_behavior",
+            "affected_component",
+            "severity"
+          ],
+          "mustContainSeverityLevel": "critical",
+          "minExtractions": 4
+        },
+        {
+          "id": "br-002",
+          "input": "Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.",
+          "mustContainClasses": [
+            "steps_to_reproduce",
+            "expected_behavior",
+            "actual_behavior",
+            "affected_component"
+          ],
+          "minExtractions": 3
+        }
+      ]
+    }
+  }
+}
--- a/services/extraction-service/evals/promptfoo.yaml
+++ b/services/extraction-service/evals/promptfoo.yaml
@ -0,0 +1,319 @@
+# promptfoo eval config for extraction-service
+# Docs: https://promptfoo.dev/docs/configuration/guide
+#
+# Usage:
+#   pnpm eval                    # run all evals (service must be running on port 4005)
+#   pnpm eval:task triage        # run a single task suite
+#   pnpm eval:ci                 # CI mode — fail on any assertion failure
+#
+# Prerequisites:
+#   1. extraction-service running: pnpm dev (port 4005)
+#   2. EXTRACTION_EVAL_TOKEN set in env (any valid JWT from platform-service)
+
+description: Extraction Service — LLM Output Quality Evals
+
+# ── Provider: the extraction-service HTTP API ────────────────────
+providers:
+  - id: http
+    config:
+      url: "{{env.EXTRACTION_SERVICE_URL | default('http://localhost:4005')}}/api/extract"
+      method: POST
+      headers:
+        Content-Type: application/json
+        Authorization: 'Bearer {{env.EXTRACTION_EVAL_TOKEN}}'
+      body:
+        text: '{{text}}'
+        taskId: '{{taskId}}'
+        productId: "{{env.EVAL_PRODUCT_ID | default('lysnrai')}}"
+      transformResponse: |
+        return {
+          extractions: json.extractions,
+          classes: json.extractions.map(e => e.extraction_class),
+          texts: json.extractions.map(e => e.extraction_text),
+          durationMs: json.metadata?.durationMs,
+        };
+
+# ── Default assertion thresholds ────────────────────────────────
+defaultTest:
+  options:
+    timeoutMs: 30000
+  assert:
+    - type: latency
+      threshold: 15000
+
+# ── Test suites per task ─────────────────────────────────────────
+tests:
+  # ── transcript-extraction ──────────────────────────────────────
+  - description: 'transcript: extracts action item and deadline'
+    vars:
+      taskId: transcript-extraction
+      text: 'John said we need to ship the feature by Friday. Sarah agreed to handle the testing.'
+    assert:
+      - type: javascript
+        value: |
+          output.classes.includes('action_item')
+      - type: javascript
+        value: |
+          output.classes.includes('deadline')
+      - type: javascript
+        value: |
+          output.classes.includes('person')
+      - type: javascript
+        value: |
+          output.texts.some(t => t.toLowerCase().includes('friday') || t.toLowerCase().includes('ship'))
+
+  - description: 'transcript: extracts decision from meeting note'
+    vars:
+      taskId: transcript-extraction
+      text: 'The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('decision')
+      - type: javascript
+        value: output.classes.includes('action_item')
+      - type: javascript
+        value: output.classes.includes('person')
+      - type: javascript
+        value: output.classes.includes('deadline')
+
+  - description: 'transcript: extracts question from discussion'
+    vars:
+      taskId: transcript-extraction
+      text: 'Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('question')
+      - type: javascript
+        value: output.classes.includes('person')
+
+  - description: 'transcript: handles multi-person transcript'
+    vars:
+      taskId: transcript-extraction
+      text: "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?"
+    assert:
+      - type: javascript
+        value: output.classes.includes('action_item')
+      - type: javascript
+        value: output.classes.includes('person')
+      - type: javascript
+        value: output.texts.some(t => t.toLowerCase().includes('maria') || t.toLowerCase().includes('tom'))
+      - type: javascript
+        value: output.extractions.length >= 3
+
+  # ── triage ─────────────────────────────────────────────────────
+  - description: 'triage: health brain signal for medical content'
+    vars:
+      taskId: triage
+      text: "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost."
+    assert:
+      - type: javascript
+        value: output.classes.includes('action')
+      - type: javascript
+        value: output.classes.includes('date_reference')
+      - type: javascript
+        value: output.classes.includes('emotion')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'brain_signal' &&
+            e.attributes?.brain === 'health'
+          )
+
+  - description: 'triage: work brain signal for project content'
+    vars:
+      taskId: triage
+      text: 'Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('action')
+      - type: javascript
+        value: output.classes.includes('date_reference')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'brain_signal' &&
+            e.attributes?.brain === 'work'
+          )
+
+  - description: 'triage: money brain signal for financial content'
+    vars:
+      taskId: triage
+      text: "I need to pay the credit card bill before the 15th or I'll get charged interest."
+    assert:
+      - type: javascript
+        value: output.classes.includes('action')
+      - type: javascript
+        value: output.classes.includes('date_reference')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'brain_signal' &&
+            e.attributes?.brain === 'money'
+          )
+
+  - description: 'triage: negative emotion detected'
+    vars:
+      taskId: triage
+      text: "Feeling really overwhelmed today. Too many things on my plate and I can't focus."
+    assert:
+      - type: javascript
+        value: output.classes.includes('emotion')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'emotion' &&
+            e.attributes?.valence === 'negative'
+          )
+
+  - description: 'triage: multiple brain signals for mixed content'
+    vars:
+      taskId: triage
+      text: 'Doctor said I need to exercise more. Also need to check my 401k contributions before year end.'
+    assert:
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'brain_signal' && e.attributes?.brain === 'health'
+          )
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'brain_signal' && e.attributes?.brain === 'money'
+          )
+
+  # ── memory-insight ─────────────────────────────────────────────
+  - description: 'memory-insight: detects recurring pattern'
+    vars:
+      taskId: memory-insight
+      text: 'Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('pattern')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'pattern' &&
+            e.attributes?.frequency === 'recurring'
+          )
+
+  - description: 'memory-insight: detects relationship between items'
+    vars:
+      taskId: memory-insight
+      text: "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop."
+    assert:
+      - type: javascript
+        value: output.classes.includes('relationship')
+      - type: javascript
+        value: output.classes.includes('pattern')
+
+  - description: 'memory-insight: detects milestone'
+    vars:
+      taskId: memory-insight
+      text: 'Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('milestone')
+      - type: javascript
+        value: output.extractions.length >= 2
+
+  - description: 'memory-insight: detects recurring theme across entries'
+    vars:
+      taskId: memory-insight
+      text: 'Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('recurring_theme')
+      - type: javascript
+        value: output.extractions.length >= 1
+
+  # ── reflection-enrichment ──────────────────────────────────────
+  - description: 'reflection: extracts accomplishment and concern'
+    vars:
+      taskId: reflection-enrichment
+      text: "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week."
+    assert:
+      - type: javascript
+        value: output.classes.includes('accomplishment')
+      - type: javascript
+        value: output.classes.includes('concern')
+      - type: javascript
+        value: output.classes.includes('emotional_state')
+
+  - description: 'reflection: positive emotional state detected'
+    vars:
+      taskId: reflection-enrichment
+      text: 'Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('emotional_state')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'emotional_state' &&
+            e.attributes?.valence === 'positive'
+          )
+      - type: javascript
+        value: output.classes.includes('accomplishment')
+
+  - description: 'reflection: goal progress detected'
+    vars:
+      taskId: reflection-enrichment
+      text: "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month."
+    assert:
+      - type: javascript
+        value: output.classes.includes('goal_progress')
+      - type: javascript
+        value: output.classes.includes('accomplishment')
+
+  - description: 'reflection: mixed positive and negative signals'
+    vars:
+      taskId: reflection-enrichment
+      text: "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up."
+    assert:
+      - type: javascript
+        value: output.classes.includes('accomplishment')
+      - type: javascript
+        value: output.classes.includes('concern')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'emotional_state' &&
+            e.attributes?.valence === 'positive'
+          )
+
+  # ── bug-report-extraction ──────────────────────────────────────
+  - description: 'bug-report: extracts all 5 fields'
+    vars:
+      taskId: bug-report-extraction
+      text: 'When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('steps_to_reproduce')
+      - type: javascript
+        value: output.classes.includes('expected_behavior')
+      - type: javascript
+        value: output.classes.includes('actual_behavior')
+      - type: javascript
+        value: output.classes.includes('affected_component')
+      - type: javascript
+        value: output.classes.includes('severity')
+      - type: javascript
+        value: |
+          output.extractions.some(e =>
+            e.extraction_class === 'severity' &&
+            e.attributes?.level === 'critical'
+          )
+
+  - description: 'bug-report: extracts steps and component from login bug'
+    vars:
+      taskId: bug-report-extraction
+      text: 'Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.'
+    assert:
+      - type: javascript
+        value: output.classes.includes('steps_to_reproduce')
+      - type: javascript
+        value: output.classes.includes('expected_behavior')
+      - type: javascript
+        value: output.classes.includes('actual_behavior')
+      - type: javascript
+        value: output.classes.includes('affected_component')
--- a/services/extraction-service/evals/run-evals.sh
+++ b/services/extraction-service/evals/run-evals.sh
@ -0,0 +1,89 @@
+#!/usr/bin/env bash
+# run-evals.sh — Run promptfoo evals against the extraction-service
+#
+# Usage:
+#   ./evals/run-evals.sh                  # run all evals
+#   ./evals/run-evals.sh --task triage    # filter by task (grep on description)
+#   ./evals/run-evals.sh --ci             # CI mode: exit 1 on any failure
+#   ./evals/run-evals.sh --output json    # output results as JSON
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SERVICE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+# ── Defaults ────────────────────────────────────────────────────
+EXTRACTION_SERVICE_URL="${EXTRACTION_SERVICE_URL:-http://localhost:4005}"
+EVAL_PRODUCT_ID="${EVAL_PRODUCT_ID:-lysnrai}"
+CI_MODE=false
+OUTPUT_FORMAT="text"
+TASK_FILTER=""
+
+# ── Parse args ──────────────────────────────────────────────────
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --ci) CI_MODE=true; shift ;;
+    --output) OUTPUT_FORMAT="$2"; shift 2 ;;
+    --task) TASK_FILTER="$2"; shift 2 ;;
+    *) echo "Unknown arg: $1"; exit 1 ;;
+  esac
+done
+
+# ── Check service is reachable ───────────────────────────────────
+echo "→ Checking extraction-service at $EXTRACTION_SERVICE_URL ..."
+if ! curl -sf "$EXTRACTION_SERVICE_URL/health" > /dev/null 2>&1; then
+  echo "✗ extraction-service is not running at $EXTRACTION_SERVICE_URL"
+  echo "  Start it with: pnpm dev (in services/extraction-service/)"
+  exit 1
+fi
+echo "✓ Service is up"
+
+# ── Check EXTRACTION_EVAL_TOKEN ──────────────────────────────────
+if [[ -z "${EXTRACTION_EVAL_TOKEN:-}" ]]; then
+  echo "⚠ EXTRACTION_EVAL_TOKEN is not set — evals will fail auth"
+  echo "  Get a token from platform-service: POST /api/auth/login"
+  echo "  Then: export EXTRACTION_EVAL_TOKEN=<token>"
+  if [[ "$CI_MODE" == "true" ]]; then
+    exit 1
+  fi
+fi
+
+# ── Build promptfoo args ─────────────────────────────────────────
+PROMPTFOO_ARGS=(
+  eval
+  --config "$SCRIPT_DIR/promptfoo.yaml"
+  --output "$OUTPUT_FORMAT"
+  --no-cache
+)
+
+if [[ "$CI_MODE" == "true" ]]; then
+  PROMPTFOO_ARGS+=(--no-progress-bar)
+fi
+
+if [[ -n "$TASK_FILTER" ]]; then
+  PROMPTFOO_ARGS+=(--filter-description "$TASK_FILTER")
+fi
+
+# ── Run ─────────────────────────────────────────────────────────
+echo "→ Running evals (task: ${TASK_FILTER:-all}) ..."
+echo ""
+
+export EXTRACTION_SERVICE_URL
+export EXTRACTION_EVAL_TOKEN
+export EVAL_PRODUCT_ID
+
+cd "$SERVICE_DIR"
+npx promptfoo "${PROMPTFOO_ARGS[@]}"
+
+EXIT_CODE=$?
+
+if [[ $EXIT_CODE -eq 0 ]]; then
+  echo ""
+  echo "✓ All evals passed"
+else
+  echo ""
+  echo "✗ Some evals failed (exit $EXIT_CODE)"
+  if [[ "$CI_MODE" == "true" ]]; then
+    exit $EXIT_CODE
+  fi
+fi