feat(extraction-service): scaffold promptfoo eval suite with 19 test cases

- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API
  covering all 5 built-in tasks (transcript, triage, memory-insight,
  reflection-enrichment, bug-report-extraction)
- Add evals/fixtures/golden.json: machine-readable golden input/output fixtures
- Add evals/run-evals.sh: shell runner with health checks, auth token
  handling, task filtering, and CI mode
- Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
This commit is contained in:
saravanakumardb1 2026-02-19 12:19:16 -08:00
parent 4a659bf107
commit acd4c3542b
4 changed files with 779 additions and 0 deletions

View File

@ -0,0 +1,194 @@
# Extraction Service — LLM Evals
Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).
## Structure
```
evals/
├── promptfoo.yaml # Main eval config — all test cases + assertions
├── fixtures/
│ └── golden.json # Golden input/output fixtures (machine-readable)
├── run-evals.sh # Shell runner with health check + auth guard
└── README.md # This file
```
## Prerequisites
1. **extraction-service running** on port 4005:
```bash
pnpm dev
```
2. **Auth token** from platform-service:
```bash
# POST /api/auth/login → copy the token
export EXTRACTION_EVAL_TOKEN=<your-jwt>
```
3. **promptfoo** (installed automatically via `npx`, or globally):
```bash
npm install -g promptfoo
```
## Running Evals
```bash
# All tasks
pnpm eval
# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction
# CI mode (exits non-zero on any failure)
pnpm eval:ci
# JSON output (for dashboards / reporting)
pnpm eval:json
```
## What's Tested
| Task | Cases | Key Assertions |
| ----------------------- | ----- | ------------------------------------------------------------------------- |
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present |
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference |
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute |
**Total: 19 eval cases, 60+ assertions**
## Adding New Cases
1. Add a test case to `promptfoo.yaml` under `tests:`:
```yaml
- description: 'triage: home brain signal for household content'
vars:
taskId: triage
text: 'Need to fix the leaking faucet in the kitchen this weekend.'
assert:
- type: javascript
value: output.classes.includes('action')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' &&
e.attributes?.brain === 'home'
)
```
2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.
## Local OSS Models (Ollama)
Run the same 19 eval cases against a local model — zero API cost, fully private.
### Install Ollama
```bash
# 1. Install
brew install ollama
# 2. Start server (keep this terminal open)
ollama serve
# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b
# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
```
### Run Ollama Evals
```bash
# Default model: llama3.1:8b
pnpm eval:ollama
# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
```
> **Note:** Ollama evals hit the model directly (no extraction-service needed).
> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
### Compare Gemini vs Ollama
Runs both suites and prints a side-by-side pass-rate table:
```bash
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare
# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
```
Example output:
```
Provider Passed Total Rate Progress
───────────────────────────────────────────────────────
Gemini 57 60 95% █████████░
Ollama 48 60 80% ████████░░
Per-task breakdown:
Task Gemini Ollama
──────────────────────────────────────────────────
triage 20/20 (100%) 16/20 (80%)
transcript-extraction 15/16 (94%) 12/16 (75%)
reflection-enrichment 12/12 (100%) 10/12 (83%)
memory-insight 8/8 (100%) 7/8 (88%)
bug-report-extraction 2/4 (50%) 3/4 (75%)
```
### Supported Local Models
| Model | Pull command | RAM needed | Notes |
| -------------- | -------------------------- | ---------- | --------------------------- |
| `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default |
| `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output |
| `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning |
| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ |
## CI Integration
Add to your GitHub Actions workflow:
```yaml
- name: Run extraction evals
env:
EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
EXTRACTION_SERVICE_URL: http://localhost:4005
run: pnpm --filter @lysnrai/extraction-service eval:ci
```
The service must be started before running evals in CI.
## Interpreting Results
promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.
Common failure patterns:
- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
- **Wrong attribute**`brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs
## Thresholds
- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics

View File

@ -0,0 +1,177 @@
{
"description": "Golden fixtures for extraction-service evals. Each entry defines the minimum expected extraction classes for a given input. Used by the promptfoo eval suite and can also be consumed by custom assertion scripts.",
"version": "1.0.0",
"tasks": {
"transcript-extraction": {
"expectedClasses": ["action_item", "decision", "question", "deadline", "person", "topic"],
"cases": [
{
"id": "tc-001",
"input": "John said we need to ship the feature by Friday. Sarah agreed to handle the testing.",
"mustContainClasses": ["action_item", "deadline", "person"],
"mustContainText": ["friday", "sarah", "john"],
"minExtractions": 3
},
{
"id": "tc-002",
"input": "The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.",
"mustContainClasses": ["decision", "action_item", "person", "deadline"],
"mustContainText": ["q3", "alice"],
"minExtractions": 3
},
{
"id": "tc-003",
"input": "Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.",
"mustContainClasses": ["question", "person"],
"mustContainText": ["bob"],
"minExtractions": 2
},
{
"id": "tc-004",
"input": "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?",
"mustContainClasses": ["action_item", "person"],
"mustContainText": ["maria", "tom"],
"minExtractions": 3
}
]
},
"triage": {
"expectedClasses": ["topic", "entity", "action", "emotion", "date_reference", "brain_signal"],
"cases": [
{
"id": "tr-001",
"input": "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost.",
"mustContainClasses": ["action", "date_reference", "emotion"],
"mustContainBrainSignal": { "brain": "health", "minConfidence": 0.5 },
"minExtractions": 3
},
{
"id": "tr-002",
"input": "Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.",
"mustContainClasses": ["action", "date_reference"],
"mustContainBrainSignal": { "brain": "work", "minConfidence": 0.5 },
"minExtractions": 2
},
{
"id": "tr-003",
"input": "I need to pay the credit card bill before the 15th or I'll get charged interest.",
"mustContainClasses": ["action", "date_reference"],
"mustContainBrainSignal": { "brain": "money", "minConfidence": 0.5 },
"minExtractions": 2
},
{
"id": "tr-004",
"input": "Feeling really overwhelmed today. Too many things on my plate and I can't focus.",
"mustContainClasses": ["emotion"],
"mustContainEmotionValence": "negative",
"minExtractions": 1
},
{
"id": "tr-005",
"input": "Doctor said I need to exercise more. Also need to check my 401k contributions before year end.",
"mustContainClasses": ["action"],
"mustContainBrainSignals": [
{ "brain": "health", "minConfidence": 0.4 },
{ "brain": "money", "minConfidence": 0.4 }
],
"minExtractions": 2
}
]
},
"memory-insight": {
"expectedClasses": ["pattern", "recurring_theme", "relationship", "milestone"],
"cases": [
{
"id": "mi-001",
"input": "Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.",
"mustContainClasses": ["pattern"],
"mustContainPatternFrequency": "recurring",
"minExtractions": 1
},
{
"id": "mi-002",
"input": "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop.",
"mustContainClasses": ["pattern", "relationship"],
"minExtractions": 2
},
{
"id": "mi-003",
"input": "Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.",
"mustContainClasses": ["milestone"],
"minExtractions": 2
},
{
"id": "mi-004",
"input": "Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.",
"mustContainClasses": ["recurring_theme"],
"minExtractions": 1
}
]
},
"reflection-enrichment": {
"expectedClasses": ["emotional_state", "accomplishment", "concern", "goal_progress"],
"cases": [
{
"id": "re-001",
"input": "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week.",
"mustContainClasses": ["accomplishment", "concern", "emotional_state"],
"minExtractions": 3
},
{
"id": "re-002",
"input": "Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.",
"mustContainClasses": ["emotional_state", "accomplishment"],
"mustContainEmotionValence": "positive",
"minExtractions": 2
},
{
"id": "re-003",
"input": "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month.",
"mustContainClasses": ["goal_progress", "accomplishment"],
"minExtractions": 2
},
{
"id": "re-004",
"input": "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up.",
"mustContainClasses": ["accomplishment", "concern"],
"minExtractions": 2
}
]
},
"bug-report-extraction": {
"expectedClasses": [
"steps_to_reproduce",
"expected_behavior",
"actual_behavior",
"affected_component",
"severity"
],
"cases": [
{
"id": "br-001",
"input": "When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.",
"mustContainClasses": [
"steps_to_reproduce",
"expected_behavior",
"actual_behavior",
"affected_component",
"severity"
],
"mustContainSeverityLevel": "critical",
"minExtractions": 4
},
{
"id": "br-002",
"input": "Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.",
"mustContainClasses": [
"steps_to_reproduce",
"expected_behavior",
"actual_behavior",
"affected_component"
],
"minExtractions": 3
}
]
}
}
}

View File

@ -0,0 +1,319 @@
# promptfoo eval config for extraction-service
# Docs: https://promptfoo.dev/docs/configuration/guide
#
# Usage:
# pnpm eval # run all evals (service must be running on port 4005)
# pnpm eval:task triage # run a single task suite
# pnpm eval:ci # CI mode — fail on any assertion failure
#
# Prerequisites:
# 1. extraction-service running: pnpm dev (port 4005)
# 2. EXTRACTION_EVAL_TOKEN set in env (any valid JWT from platform-service)
description: Extraction Service — LLM Output Quality Evals
# ── Provider: the extraction-service HTTP API ────────────────────
providers:
- id: http
config:
url: "{{env.EXTRACTION_SERVICE_URL | default('http://localhost:4005')}}/api/extract"
method: POST
headers:
Content-Type: application/json
Authorization: 'Bearer {{env.EXTRACTION_EVAL_TOKEN}}'
body:
text: '{{text}}'
taskId: '{{taskId}}'
productId: "{{env.EVAL_PRODUCT_ID | default('lysnrai')}}"
transformResponse: |
return {
extractions: json.extractions,
classes: json.extractions.map(e => e.extraction_class),
texts: json.extractions.map(e => e.extraction_text),
durationMs: json.metadata?.durationMs,
};
# ── Default assertion thresholds ────────────────────────────────
defaultTest:
options:
timeoutMs: 30000
assert:
- type: latency
threshold: 15000
# ── Test suites per task ─────────────────────────────────────────
tests:
# ── transcript-extraction ──────────────────────────────────────
- description: 'transcript: extracts action item and deadline'
vars:
taskId: transcript-extraction
text: 'John said we need to ship the feature by Friday. Sarah agreed to handle the testing.'
assert:
- type: javascript
value: |
output.classes.includes('action_item')
- type: javascript
value: |
output.classes.includes('deadline')
- type: javascript
value: |
output.classes.includes('person')
- type: javascript
value: |
output.texts.some(t => t.toLowerCase().includes('friday') || t.toLowerCase().includes('ship'))
- description: 'transcript: extracts decision from meeting note'
vars:
taskId: transcript-extraction
text: 'The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.'
assert:
- type: javascript
value: output.classes.includes('decision')
- type: javascript
value: output.classes.includes('action_item')
- type: javascript
value: output.classes.includes('person')
- type: javascript
value: output.classes.includes('deadline')
- description: 'transcript: extracts question from discussion'
vars:
taskId: transcript-extraction
text: 'Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.'
assert:
- type: javascript
value: output.classes.includes('question')
- type: javascript
value: output.classes.includes('person')
- description: 'transcript: handles multi-person transcript'
vars:
taskId: transcript-extraction
text: "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?"
assert:
- type: javascript
value: output.classes.includes('action_item')
- type: javascript
value: output.classes.includes('person')
- type: javascript
value: output.texts.some(t => t.toLowerCase().includes('maria') || t.toLowerCase().includes('tom'))
- type: javascript
value: output.extractions.length >= 3
# ── triage ─────────────────────────────────────────────────────
- description: 'triage: health brain signal for medical content'
vars:
taskId: triage
text: "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost."
assert:
- type: javascript
value: output.classes.includes('action')
- type: javascript
value: output.classes.includes('date_reference')
- type: javascript
value: output.classes.includes('emotion')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' &&
e.attributes?.brain === 'health'
)
- description: 'triage: work brain signal for project content'
vars:
taskId: triage
text: 'Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.'
assert:
- type: javascript
value: output.classes.includes('action')
- type: javascript
value: output.classes.includes('date_reference')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' &&
e.attributes?.brain === 'work'
)
- description: 'triage: money brain signal for financial content'
vars:
taskId: triage
text: "I need to pay the credit card bill before the 15th or I'll get charged interest."
assert:
- type: javascript
value: output.classes.includes('action')
- type: javascript
value: output.classes.includes('date_reference')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' &&
e.attributes?.brain === 'money'
)
- description: 'triage: negative emotion detected'
vars:
taskId: triage
text: "Feeling really overwhelmed today. Too many things on my plate and I can't focus."
assert:
- type: javascript
value: output.classes.includes('emotion')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'emotion' &&
e.attributes?.valence === 'negative'
)
- description: 'triage: multiple brain signals for mixed content'
vars:
taskId: triage
text: 'Doctor said I need to exercise more. Also need to check my 401k contributions before year end.'
assert:
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' && e.attributes?.brain === 'health'
)
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' && e.attributes?.brain === 'money'
)
# ── memory-insight ─────────────────────────────────────────────
- description: 'memory-insight: detects recurring pattern'
vars:
taskId: memory-insight
text: 'Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.'
assert:
- type: javascript
value: output.classes.includes('pattern')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'pattern' &&
e.attributes?.frequency === 'recurring'
)
- description: 'memory-insight: detects relationship between items'
vars:
taskId: memory-insight
text: "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop."
assert:
- type: javascript
value: output.classes.includes('relationship')
- type: javascript
value: output.classes.includes('pattern')
- description: 'memory-insight: detects milestone'
vars:
taskId: memory-insight
text: 'Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.'
assert:
- type: javascript
value: output.classes.includes('milestone')
- type: javascript
value: output.extractions.length >= 2
- description: 'memory-insight: detects recurring theme across entries'
vars:
taskId: memory-insight
text: 'Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.'
assert:
- type: javascript
value: output.classes.includes('recurring_theme')
- type: javascript
value: output.extractions.length >= 1
# ── reflection-enrichment ──────────────────────────────────────
- description: 'reflection: extracts accomplishment and concern'
vars:
taskId: reflection-enrichment
text: "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week."
assert:
- type: javascript
value: output.classes.includes('accomplishment')
- type: javascript
value: output.classes.includes('concern')
- type: javascript
value: output.classes.includes('emotional_state')
- description: 'reflection: positive emotional state detected'
vars:
taskId: reflection-enrichment
text: 'Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.'
assert:
- type: javascript
value: output.classes.includes('emotional_state')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'emotional_state' &&
e.attributes?.valence === 'positive'
)
- type: javascript
value: output.classes.includes('accomplishment')
- description: 'reflection: goal progress detected'
vars:
taskId: reflection-enrichment
text: "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month."
assert:
- type: javascript
value: output.classes.includes('goal_progress')
- type: javascript
value: output.classes.includes('accomplishment')
- description: 'reflection: mixed positive and negative signals'
vars:
taskId: reflection-enrichment
text: "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up."
assert:
- type: javascript
value: output.classes.includes('accomplishment')
- type: javascript
value: output.classes.includes('concern')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'emotional_state' &&
e.attributes?.valence === 'positive'
)
# ── bug-report-extraction ──────────────────────────────────────
- description: 'bug-report: extracts all 5 fields'
vars:
taskId: bug-report-extraction
text: 'When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.'
assert:
- type: javascript
value: output.classes.includes('steps_to_reproduce')
- type: javascript
value: output.classes.includes('expected_behavior')
- type: javascript
value: output.classes.includes('actual_behavior')
- type: javascript
value: output.classes.includes('affected_component')
- type: javascript
value: output.classes.includes('severity')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'severity' &&
e.attributes?.level === 'critical'
)
- description: 'bug-report: extracts steps and component from login bug'
vars:
taskId: bug-report-extraction
text: 'Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.'
assert:
- type: javascript
value: output.classes.includes('steps_to_reproduce')
- type: javascript
value: output.classes.includes('expected_behavior')
- type: javascript
value: output.classes.includes('actual_behavior')
- type: javascript
value: output.classes.includes('affected_component')

View File

@ -0,0 +1,89 @@
#!/usr/bin/env bash
# run-evals.sh — Run promptfoo evals against the extraction-service
#
# Usage:
# ./evals/run-evals.sh # run all evals
# ./evals/run-evals.sh --task triage # filter by task (grep on description)
# ./evals/run-evals.sh --ci # CI mode: exit 1 on any failure
# ./evals/run-evals.sh --output json # output results as JSON
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SERVICE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
# ── Defaults ────────────────────────────────────────────────────
EXTRACTION_SERVICE_URL="${EXTRACTION_SERVICE_URL:-http://localhost:4005}"
EVAL_PRODUCT_ID="${EVAL_PRODUCT_ID:-lysnrai}"
CI_MODE=false
OUTPUT_FORMAT="text"
TASK_FILTER=""
# ── Parse args ──────────────────────────────────────────────────
while [[ $# -gt 0 ]]; do
case "$1" in
--ci) CI_MODE=true; shift ;;
--output) OUTPUT_FORMAT="$2"; shift 2 ;;
--task) TASK_FILTER="$2"; shift 2 ;;
*) echo "Unknown arg: $1"; exit 1 ;;
esac
done
# ── Check service is reachable ───────────────────────────────────
echo "→ Checking extraction-service at $EXTRACTION_SERVICE_URL ..."
if ! curl -sf "$EXTRACTION_SERVICE_URL/health" > /dev/null 2>&1; then
echo "✗ extraction-service is not running at $EXTRACTION_SERVICE_URL"
echo " Start it with: pnpm dev (in services/extraction-service/)"
exit 1
fi
echo "✓ Service is up"
# ── Check EXTRACTION_EVAL_TOKEN ──────────────────────────────────
if [[ -z "${EXTRACTION_EVAL_TOKEN:-}" ]]; then
echo "⚠ EXTRACTION_EVAL_TOKEN is not set — evals will fail auth"
echo " Get a token from platform-service: POST /api/auth/login"
echo " Then: export EXTRACTION_EVAL_TOKEN=<token>"
if [[ "$CI_MODE" == "true" ]]; then
exit 1
fi
fi
# ── Build promptfoo args ─────────────────────────────────────────
PROMPTFOO_ARGS=(
eval
--config "$SCRIPT_DIR/promptfoo.yaml"
--output "$OUTPUT_FORMAT"
--no-cache
)
if [[ "$CI_MODE" == "true" ]]; then
PROMPTFOO_ARGS+=(--no-progress-bar)
fi
if [[ -n "$TASK_FILTER" ]]; then
PROMPTFOO_ARGS+=(--filter-description "$TASK_FILTER")
fi
# ── Run ─────────────────────────────────────────────────────────
echo "→ Running evals (task: ${TASK_FILTER:-all}) ..."
echo ""
export EXTRACTION_SERVICE_URL
export EXTRACTION_EVAL_TOKEN
export EVAL_PRODUCT_ID
cd "$SERVICE_DIR"
npx promptfoo "${PROMPTFOO_ARGS[@]}"
EXIT_CODE=$?
if [[ $EXIT_CODE -eq 0 ]]; then
echo ""
echo "✓ All evals passed"
else
echo ""
echo "✗ Some evals failed (exit $EXIT_CODE)"
if [[ "$CI_MODE" == "true" ]]; then
exit $EXIT_CODE
fi
fi