feat(extraction-service): scaffold promptfoo eval suite with 19 test cases
- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API covering all 5 built-in tasks (transcript, triage, memory-insight, reflection-enrichment, bug-report-extraction) - Add evals/fixtures/golden.json: machine-readable golden input/output fixtures - Add evals/run-evals.sh: shell runner with health checks, auth token handling, task filtering, and CI mode - Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
This commit is contained in:
parent
4a659bf107
commit
acd4c3542b
194
services/extraction-service/evals/README.md
Normal file
194
services/extraction-service/evals/README.md
Normal file
@ -0,0 +1,194 @@
|
||||
# Extraction Service — LLM Evals
|
||||
|
||||
Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
evals/
|
||||
├── promptfoo.yaml # Main eval config — all test cases + assertions
|
||||
├── fixtures/
|
||||
│ └── golden.json # Golden input/output fixtures (machine-readable)
|
||||
├── run-evals.sh # Shell runner with health check + auth guard
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **extraction-service running** on port 4005:
|
||||
|
||||
```bash
|
||||
pnpm dev
|
||||
```
|
||||
|
||||
2. **Auth token** from platform-service:
|
||||
|
||||
```bash
|
||||
# POST /api/auth/login → copy the token
|
||||
export EXTRACTION_EVAL_TOKEN=<your-jwt>
|
||||
```
|
||||
|
||||
3. **promptfoo** (installed automatically via `npx`, or globally):
|
||||
```bash
|
||||
npm install -g promptfoo
|
||||
```
|
||||
|
||||
## Running Evals
|
||||
|
||||
```bash
|
||||
# All tasks
|
||||
pnpm eval
|
||||
|
||||
# Single task
|
||||
pnpm eval:task triage
|
||||
pnpm eval:task transcript-extraction
|
||||
pnpm eval:task memory-insight
|
||||
pnpm eval:task reflection-enrichment
|
||||
pnpm eval:task bug-report-extraction
|
||||
|
||||
# CI mode (exits non-zero on any failure)
|
||||
pnpm eval:ci
|
||||
|
||||
# JSON output (for dashboards / reporting)
|
||||
pnpm eval:json
|
||||
```
|
||||
|
||||
## What's Tested
|
||||
|
||||
| Task | Cases | Key Assertions |
|
||||
| ----------------------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present |
|
||||
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference |
|
||||
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
|
||||
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
|
||||
| `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute |
|
||||
|
||||
**Total: 19 eval cases, 60+ assertions**
|
||||
|
||||
## Adding New Cases
|
||||
|
||||
1. Add a test case to `promptfoo.yaml` under `tests:`:
|
||||
|
||||
```yaml
|
||||
- description: 'triage: home brain signal for household content'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: 'Need to fix the leaking faucet in the kitchen this weekend.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' &&
|
||||
e.attributes?.brain === 'home'
|
||||
)
|
||||
```
|
||||
|
||||
2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.
|
||||
|
||||
## Local OSS Models (Ollama)
|
||||
|
||||
Run the same 19 eval cases against a local model — zero API cost, fully private.
|
||||
|
||||
### Install Ollama
|
||||
|
||||
```bash
|
||||
# 1. Install
|
||||
brew install ollama
|
||||
|
||||
# 2. Start server (keep this terminal open)
|
||||
ollama serve
|
||||
|
||||
# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
|
||||
ollama pull llama3.1:8b
|
||||
|
||||
# If corp proxy blocks it, try bypassing:
|
||||
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
|
||||
|
||||
# 4. Verify
|
||||
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
|
||||
```
|
||||
|
||||
### Run Ollama Evals
|
||||
|
||||
```bash
|
||||
# Default model: llama3.1:8b
|
||||
pnpm eval:ollama
|
||||
|
||||
# Different model (once pulled)
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
||||
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
|
||||
```
|
||||
|
||||
> **Note:** Ollama evals hit the model directly (no extraction-service needed).
|
||||
> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
|
||||
|
||||
### Compare Gemini vs Ollama
|
||||
|
||||
Runs both suites and prints a side-by-side pass-rate table:
|
||||
|
||||
```bash
|
||||
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
|
||||
pnpm eval:compare
|
||||
|
||||
# With a different local model
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
|
||||
```
|
||||
|
||||
Example output:
|
||||
|
||||
```
|
||||
Provider Passed Total Rate Progress
|
||||
───────────────────────────────────────────────────────
|
||||
Gemini 57 60 95% █████████░
|
||||
Ollama 48 60 80% ████████░░
|
||||
|
||||
Per-task breakdown:
|
||||
Task Gemini Ollama
|
||||
──────────────────────────────────────────────────
|
||||
triage 20/20 (100%) 16/20 (80%)
|
||||
transcript-extraction 15/16 (94%) 12/16 (75%)
|
||||
reflection-enrichment 12/12 (100%) 10/12 (83%)
|
||||
memory-insight 8/8 (100%) 7/8 (88%)
|
||||
bug-report-extraction 2/4 (50%) 3/4 (75%)
|
||||
```
|
||||
|
||||
### Supported Local Models
|
||||
|
||||
| Model | Pull command | RAM needed | Notes |
|
||||
| -------------- | -------------------------- | ---------- | --------------------------- |
|
||||
| `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default |
|
||||
| `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output |
|
||||
| `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning |
|
||||
| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ |
|
||||
|
||||
## CI Integration
|
||||
|
||||
Add to your GitHub Actions workflow:
|
||||
|
||||
```yaml
|
||||
- name: Run extraction evals
|
||||
env:
|
||||
EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
|
||||
EXTRACTION_SERVICE_URL: http://localhost:4005
|
||||
run: pnpm --filter @lysnrai/extraction-service eval:ci
|
||||
```
|
||||
|
||||
The service must be started before running evals in CI.
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.
|
||||
|
||||
Common failure patterns:
|
||||
|
||||
- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
|
||||
- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
|
||||
- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs
|
||||
|
||||
## Thresholds
|
||||
|
||||
- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
|
||||
- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
|
||||
- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics
|
||||
177
services/extraction-service/evals/fixtures/golden.json
Normal file
177
services/extraction-service/evals/fixtures/golden.json
Normal file
@ -0,0 +1,177 @@
|
||||
{
|
||||
"description": "Golden fixtures for extraction-service evals. Each entry defines the minimum expected extraction classes for a given input. Used by the promptfoo eval suite and can also be consumed by custom assertion scripts.",
|
||||
"version": "1.0.0",
|
||||
"tasks": {
|
||||
"transcript-extraction": {
|
||||
"expectedClasses": ["action_item", "decision", "question", "deadline", "person", "topic"],
|
||||
"cases": [
|
||||
{
|
||||
"id": "tc-001",
|
||||
"input": "John said we need to ship the feature by Friday. Sarah agreed to handle the testing.",
|
||||
"mustContainClasses": ["action_item", "deadline", "person"],
|
||||
"mustContainText": ["friday", "sarah", "john"],
|
||||
"minExtractions": 3
|
||||
},
|
||||
{
|
||||
"id": "tc-002",
|
||||
"input": "The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.",
|
||||
"mustContainClasses": ["decision", "action_item", "person", "deadline"],
|
||||
"mustContainText": ["q3", "alice"],
|
||||
"minExtractions": 3
|
||||
},
|
||||
{
|
||||
"id": "tc-003",
|
||||
"input": "Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.",
|
||||
"mustContainClasses": ["question", "person"],
|
||||
"mustContainText": ["bob"],
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "tc-004",
|
||||
"input": "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?",
|
||||
"mustContainClasses": ["action_item", "person"],
|
||||
"mustContainText": ["maria", "tom"],
|
||||
"minExtractions": 3
|
||||
}
|
||||
]
|
||||
},
|
||||
"triage": {
|
||||
"expectedClasses": ["topic", "entity", "action", "emotion", "date_reference", "brain_signal"],
|
||||
"cases": [
|
||||
{
|
||||
"id": "tr-001",
|
||||
"input": "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost.",
|
||||
"mustContainClasses": ["action", "date_reference", "emotion"],
|
||||
"mustContainBrainSignal": { "brain": "health", "minConfidence": 0.5 },
|
||||
"minExtractions": 3
|
||||
},
|
||||
{
|
||||
"id": "tr-002",
|
||||
"input": "Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.",
|
||||
"mustContainClasses": ["action", "date_reference"],
|
||||
"mustContainBrainSignal": { "brain": "work", "minConfidence": 0.5 },
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "tr-003",
|
||||
"input": "I need to pay the credit card bill before the 15th or I'll get charged interest.",
|
||||
"mustContainClasses": ["action", "date_reference"],
|
||||
"mustContainBrainSignal": { "brain": "money", "minConfidence": 0.5 },
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "tr-004",
|
||||
"input": "Feeling really overwhelmed today. Too many things on my plate and I can't focus.",
|
||||
"mustContainClasses": ["emotion"],
|
||||
"mustContainEmotionValence": "negative",
|
||||
"minExtractions": 1
|
||||
},
|
||||
{
|
||||
"id": "tr-005",
|
||||
"input": "Doctor said I need to exercise more. Also need to check my 401k contributions before year end.",
|
||||
"mustContainClasses": ["action"],
|
||||
"mustContainBrainSignals": [
|
||||
{ "brain": "health", "minConfidence": 0.4 },
|
||||
{ "brain": "money", "minConfidence": 0.4 }
|
||||
],
|
||||
"minExtractions": 2
|
||||
}
|
||||
]
|
||||
},
|
||||
"memory-insight": {
|
||||
"expectedClasses": ["pattern", "recurring_theme", "relationship", "milestone"],
|
||||
"cases": [
|
||||
{
|
||||
"id": "mi-001",
|
||||
"input": "Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.",
|
||||
"mustContainClasses": ["pattern"],
|
||||
"mustContainPatternFrequency": "recurring",
|
||||
"minExtractions": 1
|
||||
},
|
||||
{
|
||||
"id": "mi-002",
|
||||
"input": "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop.",
|
||||
"mustContainClasses": ["pattern", "relationship"],
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "mi-003",
|
||||
"input": "Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.",
|
||||
"mustContainClasses": ["milestone"],
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "mi-004",
|
||||
"input": "Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.",
|
||||
"mustContainClasses": ["recurring_theme"],
|
||||
"minExtractions": 1
|
||||
}
|
||||
]
|
||||
},
|
||||
"reflection-enrichment": {
|
||||
"expectedClasses": ["emotional_state", "accomplishment", "concern", "goal_progress"],
|
||||
"cases": [
|
||||
{
|
||||
"id": "re-001",
|
||||
"input": "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week.",
|
||||
"mustContainClasses": ["accomplishment", "concern", "emotional_state"],
|
||||
"minExtractions": 3
|
||||
},
|
||||
{
|
||||
"id": "re-002",
|
||||
"input": "Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.",
|
||||
"mustContainClasses": ["emotional_state", "accomplishment"],
|
||||
"mustContainEmotionValence": "positive",
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "re-003",
|
||||
"input": "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month.",
|
||||
"mustContainClasses": ["goal_progress", "accomplishment"],
|
||||
"minExtractions": 2
|
||||
},
|
||||
{
|
||||
"id": "re-004",
|
||||
"input": "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up.",
|
||||
"mustContainClasses": ["accomplishment", "concern"],
|
||||
"minExtractions": 2
|
||||
}
|
||||
]
|
||||
},
|
||||
"bug-report-extraction": {
|
||||
"expectedClasses": [
|
||||
"steps_to_reproduce",
|
||||
"expected_behavior",
|
||||
"actual_behavior",
|
||||
"affected_component",
|
||||
"severity"
|
||||
],
|
||||
"cases": [
|
||||
{
|
||||
"id": "br-001",
|
||||
"input": "When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.",
|
||||
"mustContainClasses": [
|
||||
"steps_to_reproduce",
|
||||
"expected_behavior",
|
||||
"actual_behavior",
|
||||
"affected_component",
|
||||
"severity"
|
||||
],
|
||||
"mustContainSeverityLevel": "critical",
|
||||
"minExtractions": 4
|
||||
},
|
||||
{
|
||||
"id": "br-002",
|
||||
"input": "Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.",
|
||||
"mustContainClasses": [
|
||||
"steps_to_reproduce",
|
||||
"expected_behavior",
|
||||
"actual_behavior",
|
||||
"affected_component"
|
||||
],
|
||||
"minExtractions": 3
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
319
services/extraction-service/evals/promptfoo.yaml
Normal file
319
services/extraction-service/evals/promptfoo.yaml
Normal file
@ -0,0 +1,319 @@
|
||||
# promptfoo eval config for extraction-service
|
||||
# Docs: https://promptfoo.dev/docs/configuration/guide
|
||||
#
|
||||
# Usage:
|
||||
# pnpm eval # run all evals (service must be running on port 4005)
|
||||
# pnpm eval:task triage # run a single task suite
|
||||
# pnpm eval:ci # CI mode — fail on any assertion failure
|
||||
#
|
||||
# Prerequisites:
|
||||
# 1. extraction-service running: pnpm dev (port 4005)
|
||||
# 2. EXTRACTION_EVAL_TOKEN set in env (any valid JWT from platform-service)
|
||||
|
||||
description: Extraction Service — LLM Output Quality Evals
|
||||
|
||||
# ── Provider: the extraction-service HTTP API ────────────────────
|
||||
providers:
|
||||
- id: http
|
||||
config:
|
||||
url: "{{env.EXTRACTION_SERVICE_URL | default('http://localhost:4005')}}/api/extract"
|
||||
method: POST
|
||||
headers:
|
||||
Content-Type: application/json
|
||||
Authorization: 'Bearer {{env.EXTRACTION_EVAL_TOKEN}}'
|
||||
body:
|
||||
text: '{{text}}'
|
||||
taskId: '{{taskId}}'
|
||||
productId: "{{env.EVAL_PRODUCT_ID | default('lysnrai')}}"
|
||||
transformResponse: |
|
||||
return {
|
||||
extractions: json.extractions,
|
||||
classes: json.extractions.map(e => e.extraction_class),
|
||||
texts: json.extractions.map(e => e.extraction_text),
|
||||
durationMs: json.metadata?.durationMs,
|
||||
};
|
||||
|
||||
# ── Default assertion thresholds ────────────────────────────────
|
||||
defaultTest:
|
||||
options:
|
||||
timeoutMs: 30000
|
||||
assert:
|
||||
- type: latency
|
||||
threshold: 15000
|
||||
|
||||
# ── Test suites per task ─────────────────────────────────────────
|
||||
tests:
|
||||
# ── transcript-extraction ──────────────────────────────────────
|
||||
- description: 'transcript: extracts action item and deadline'
|
||||
vars:
|
||||
taskId: transcript-extraction
|
||||
text: 'John said we need to ship the feature by Friday. Sarah agreed to handle the testing.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: |
|
||||
output.classes.includes('action_item')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.classes.includes('deadline')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.classes.includes('person')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.texts.some(t => t.toLowerCase().includes('friday') || t.toLowerCase().includes('ship'))
|
||||
|
||||
- description: 'transcript: extracts decision from meeting note'
|
||||
vars:
|
||||
taskId: transcript-extraction
|
||||
text: 'The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('decision')
|
||||
- type: javascript
|
||||
value: output.classes.includes('action_item')
|
||||
- type: javascript
|
||||
value: output.classes.includes('person')
|
||||
- type: javascript
|
||||
value: output.classes.includes('deadline')
|
||||
|
||||
- description: 'transcript: extracts question from discussion'
|
||||
vars:
|
||||
taskId: transcript-extraction
|
||||
text: 'Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('question')
|
||||
- type: javascript
|
||||
value: output.classes.includes('person')
|
||||
|
||||
- description: 'transcript: handles multi-person transcript'
|
||||
vars:
|
||||
taskId: transcript-extraction
|
||||
text: "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?"
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('action_item')
|
||||
- type: javascript
|
||||
value: output.classes.includes('person')
|
||||
- type: javascript
|
||||
value: output.texts.some(t => t.toLowerCase().includes('maria') || t.toLowerCase().includes('tom'))
|
||||
- type: javascript
|
||||
value: output.extractions.length >= 3
|
||||
|
||||
# ── triage ─────────────────────────────────────────────────────
|
||||
- description: 'triage: health brain signal for medical content'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
- type: javascript
|
||||
value: output.classes.includes('date_reference')
|
||||
- type: javascript
|
||||
value: output.classes.includes('emotion')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' &&
|
||||
e.attributes?.brain === 'health'
|
||||
)
|
||||
|
||||
- description: 'triage: work brain signal for project content'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: 'Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
- type: javascript
|
||||
value: output.classes.includes('date_reference')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' &&
|
||||
e.attributes?.brain === 'work'
|
||||
)
|
||||
|
||||
- description: 'triage: money brain signal for financial content'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: "I need to pay the credit card bill before the 15th or I'll get charged interest."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
- type: javascript
|
||||
value: output.classes.includes('date_reference')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' &&
|
||||
e.attributes?.brain === 'money'
|
||||
)
|
||||
|
||||
- description: 'triage: negative emotion detected'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: "Feeling really overwhelmed today. Too many things on my plate and I can't focus."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('emotion')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'emotion' &&
|
||||
e.attributes?.valence === 'negative'
|
||||
)
|
||||
|
||||
- description: 'triage: multiple brain signals for mixed content'
|
||||
vars:
|
||||
taskId: triage
|
||||
text: 'Doctor said I need to exercise more. Also need to check my 401k contributions before year end.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' && e.attributes?.brain === 'health'
|
||||
)
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'brain_signal' && e.attributes?.brain === 'money'
|
||||
)
|
||||
|
||||
# ── memory-insight ─────────────────────────────────────────────
|
||||
- description: 'memory-insight: detects recurring pattern'
|
||||
vars:
|
||||
taskId: memory-insight
|
||||
text: 'Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('pattern')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'pattern' &&
|
||||
e.attributes?.frequency === 'recurring'
|
||||
)
|
||||
|
||||
- description: 'memory-insight: detects relationship between items'
|
||||
vars:
|
||||
taskId: memory-insight
|
||||
text: "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('relationship')
|
||||
- type: javascript
|
||||
value: output.classes.includes('pattern')
|
||||
|
||||
- description: 'memory-insight: detects milestone'
|
||||
vars:
|
||||
taskId: memory-insight
|
||||
text: 'Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('milestone')
|
||||
- type: javascript
|
||||
value: output.extractions.length >= 2
|
||||
|
||||
- description: 'memory-insight: detects recurring theme across entries'
|
||||
vars:
|
||||
taskId: memory-insight
|
||||
text: 'Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('recurring_theme')
|
||||
- type: javascript
|
||||
value: output.extractions.length >= 1
|
||||
|
||||
# ── reflection-enrichment ──────────────────────────────────────
|
||||
- description: 'reflection: extracts accomplishment and concern'
|
||||
vars:
|
||||
taskId: reflection-enrichment
|
||||
text: "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('accomplishment')
|
||||
- type: javascript
|
||||
value: output.classes.includes('concern')
|
||||
- type: javascript
|
||||
value: output.classes.includes('emotional_state')
|
||||
|
||||
- description: 'reflection: positive emotional state detected'
|
||||
vars:
|
||||
taskId: reflection-enrichment
|
||||
text: 'Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('emotional_state')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'emotional_state' &&
|
||||
e.attributes?.valence === 'positive'
|
||||
)
|
||||
- type: javascript
|
||||
value: output.classes.includes('accomplishment')
|
||||
|
||||
- description: 'reflection: goal progress detected'
|
||||
vars:
|
||||
taskId: reflection-enrichment
|
||||
text: "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('goal_progress')
|
||||
- type: javascript
|
||||
value: output.classes.includes('accomplishment')
|
||||
|
||||
- description: 'reflection: mixed positive and negative signals'
|
||||
vars:
|
||||
taskId: reflection-enrichment
|
||||
text: "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up."
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('accomplishment')
|
||||
- type: javascript
|
||||
value: output.classes.includes('concern')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'emotional_state' &&
|
||||
e.attributes?.valence === 'positive'
|
||||
)
|
||||
|
||||
# ── bug-report-extraction ──────────────────────────────────────
|
||||
- description: 'bug-report: extracts all 5 fields'
|
||||
vars:
|
||||
taskId: bug-report-extraction
|
||||
text: 'When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('steps_to_reproduce')
|
||||
- type: javascript
|
||||
value: output.classes.includes('expected_behavior')
|
||||
- type: javascript
|
||||
value: output.classes.includes('actual_behavior')
|
||||
- type: javascript
|
||||
value: output.classes.includes('affected_component')
|
||||
- type: javascript
|
||||
value: output.classes.includes('severity')
|
||||
- type: javascript
|
||||
value: |
|
||||
output.extractions.some(e =>
|
||||
e.extraction_class === 'severity' &&
|
||||
e.attributes?.level === 'critical'
|
||||
)
|
||||
|
||||
- description: 'bug-report: extracts steps and component from login bug'
|
||||
vars:
|
||||
taskId: bug-report-extraction
|
||||
text: 'Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.'
|
||||
assert:
|
||||
- type: javascript
|
||||
value: output.classes.includes('steps_to_reproduce')
|
||||
- type: javascript
|
||||
value: output.classes.includes('expected_behavior')
|
||||
- type: javascript
|
||||
value: output.classes.includes('actual_behavior')
|
||||
- type: javascript
|
||||
value: output.classes.includes('affected_component')
|
||||
89
services/extraction-service/evals/run-evals.sh
Executable file
89
services/extraction-service/evals/run-evals.sh
Executable file
@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-evals.sh — Run promptfoo evals against the extraction-service
|
||||
#
|
||||
# Usage:
|
||||
# ./evals/run-evals.sh # run all evals
|
||||
# ./evals/run-evals.sh --task triage # filter by task (grep on description)
|
||||
# ./evals/run-evals.sh --ci # CI mode: exit 1 on any failure
|
||||
# ./evals/run-evals.sh --output json # output results as JSON
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
SERVICE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
|
||||
# ── Defaults ────────────────────────────────────────────────────
|
||||
EXTRACTION_SERVICE_URL="${EXTRACTION_SERVICE_URL:-http://localhost:4005}"
|
||||
EVAL_PRODUCT_ID="${EVAL_PRODUCT_ID:-lysnrai}"
|
||||
CI_MODE=false
|
||||
OUTPUT_FORMAT="text"
|
||||
TASK_FILTER=""
|
||||
|
||||
# ── Parse args ──────────────────────────────────────────────────
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--ci) CI_MODE=true; shift ;;
|
||||
--output) OUTPUT_FORMAT="$2"; shift 2 ;;
|
||||
--task) TASK_FILTER="$2"; shift 2 ;;
|
||||
*) echo "Unknown arg: $1"; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# ── Check service is reachable ───────────────────────────────────
|
||||
echo "→ Checking extraction-service at $EXTRACTION_SERVICE_URL ..."
|
||||
if ! curl -sf "$EXTRACTION_SERVICE_URL/health" > /dev/null 2>&1; then
|
||||
echo "✗ extraction-service is not running at $EXTRACTION_SERVICE_URL"
|
||||
echo " Start it with: pnpm dev (in services/extraction-service/)"
|
||||
exit 1
|
||||
fi
|
||||
echo "✓ Service is up"
|
||||
|
||||
# ── Check EXTRACTION_EVAL_TOKEN ──────────────────────────────────
|
||||
if [[ -z "${EXTRACTION_EVAL_TOKEN:-}" ]]; then
|
||||
echo "⚠ EXTRACTION_EVAL_TOKEN is not set — evals will fail auth"
|
||||
echo " Get a token from platform-service: POST /api/auth/login"
|
||||
echo " Then: export EXTRACTION_EVAL_TOKEN=<token>"
|
||||
if [[ "$CI_MODE" == "true" ]]; then
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── Build promptfoo args ─────────────────────────────────────────
|
||||
PROMPTFOO_ARGS=(
|
||||
eval
|
||||
--config "$SCRIPT_DIR/promptfoo.yaml"
|
||||
--output "$OUTPUT_FORMAT"
|
||||
--no-cache
|
||||
)
|
||||
|
||||
if [[ "$CI_MODE" == "true" ]]; then
|
||||
PROMPTFOO_ARGS+=(--no-progress-bar)
|
||||
fi
|
||||
|
||||
if [[ -n "$TASK_FILTER" ]]; then
|
||||
PROMPTFOO_ARGS+=(--filter-description "$TASK_FILTER")
|
||||
fi
|
||||
|
||||
# ── Run ─────────────────────────────────────────────────────────
|
||||
echo "→ Running evals (task: ${TASK_FILTER:-all}) ..."
|
||||
echo ""
|
||||
|
||||
export EXTRACTION_SERVICE_URL
|
||||
export EXTRACTION_EVAL_TOKEN
|
||||
export EVAL_PRODUCT_ID
|
||||
|
||||
cd "$SERVICE_DIR"
|
||||
npx promptfoo "${PROMPTFOO_ARGS[@]}"
|
||||
|
||||
EXIT_CODE=$?
|
||||
|
||||
if [[ $EXIT_CODE -eq 0 ]]; then
|
||||
echo ""
|
||||
echo "✓ All evals passed"
|
||||
else
|
||||
echo ""
|
||||
echo "✗ Some evals failed (exit $EXIT_CODE)"
|
||||
if [[ "$CI_MODE" == "true" ]]; then
|
||||
exit $EXIT_CODE
|
||||
fi
|
||||
fi
|
||||
Loading…
Reference in New Issue
Block a user