diff --git a/services/extraction-service/evals/README.md b/services/extraction-service/evals/README.md new file mode 100644 index 00000000..9944de87 --- /dev/null +++ b/services/extraction-service/evals/README.md @@ -0,0 +1,194 @@ +# Extraction Service — LLM Evals + +Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev). + +## Structure + +``` +evals/ +├── promptfoo.yaml # Main eval config — all test cases + assertions +├── fixtures/ +│ └── golden.json # Golden input/output fixtures (machine-readable) +├── run-evals.sh # Shell runner with health check + auth guard +└── README.md # This file +``` + +## Prerequisites + +1. **extraction-service running** on port 4005: + + ```bash + pnpm dev + ``` + +2. **Auth token** from platform-service: + + ```bash + # POST /api/auth/login → copy the token + export EXTRACTION_EVAL_TOKEN= + ``` + +3. **promptfoo** (installed automatically via `npx`, or globally): + ```bash + npm install -g promptfoo + ``` + +## Running Evals + +```bash +# All tasks +pnpm eval + +# Single task +pnpm eval:task triage +pnpm eval:task transcript-extraction +pnpm eval:task memory-insight +pnpm eval:task reflection-enrichment +pnpm eval:task bug-report-extraction + +# CI mode (exits non-zero on any failure) +pnpm eval:ci + +# JSON output (for dashboards / reporting) +pnpm eval:json +``` + +## What's Tested + +| Task | Cases | Key Assertions | +| ----------------------- | ----- | ------------------------------------------------------------------------- | +| `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present | +| `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference | +| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | +| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | +| `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute | + +**Total: 19 eval cases, 60+ assertions** + +## Adding New Cases + +1. Add a test case to `promptfoo.yaml` under `tests:`: + + ```yaml + - description: 'triage: home brain signal for household content' + vars: + taskId: triage + text: 'Need to fix the leaking faucet in the kitchen this weekend.' + assert: + - type: javascript + value: output.classes.includes('action') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && + e.attributes?.brain === 'home' + ) + ``` + +2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking. + +## Local OSS Models (Ollama) + +Run the same 19 eval cases against a local model — zero API cost, fully private. + +### Install Ollama + +```bash +# 1. Install +brew install ollama + +# 2. Start server (keep this terminal open) +ollama serve + +# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy) +ollama pull llama3.1:8b + +# If corp proxy blocks it, try bypassing: +HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b + +# 4. Verify +ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap +``` + +### Run Ollama Evals + +```bash +# Default model: llama3.1:8b +pnpm eval:ollama + +# Different model (once pulled) +OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama +OLLAMA_MODEL=phi4:14b pnpm eval:ollama +``` + +> **Note:** Ollama evals hit the model directly (no extraction-service needed). +> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower. + +### Compare Gemini vs Ollama + +Runs both suites and prints a side-by-side pass-rate table: + +```bash +# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running +pnpm eval:compare + +# With a different local model +OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare +``` + +Example output: + +``` + Provider Passed Total Rate Progress + ─────────────────────────────────────────────────────── + Gemini 57 60 95% █████████░ + Ollama 48 60 80% ████████░░ + + Per-task breakdown: + Task Gemini Ollama + ────────────────────────────────────────────────── + triage 20/20 (100%) 16/20 (80%) + transcript-extraction 15/16 (94%) 12/16 (75%) + reflection-enrichment 12/12 (100%) 10/12 (83%) + memory-insight 8/8 (100%) 7/8 (88%) + bug-report-extraction 2/4 (50%) 3/4 (75%) +``` + +### Supported Local Models + +| Model | Pull command | RAM needed | Notes | +| -------------- | -------------------------- | ---------- | --------------------------- | +| `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default | +| `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output | +| `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning | +| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ | + +## CI Integration + +Add to your GitHub Actions workflow: + +```yaml +- name: Run extraction evals + env: + EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }} + EXTRACTION_SERVICE_URL: http://localhost:4005 + run: pnpm --filter @lysnrai/extraction-service eval:ci +``` + +The service must be started before running evals in CI. + +## Interpreting Results + +promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails. + +Common failure patterns: + +- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed +- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts` +- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs + +## Thresholds + +- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`) +- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs) +- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics diff --git a/services/extraction-service/evals/fixtures/golden.json b/services/extraction-service/evals/fixtures/golden.json new file mode 100644 index 00000000..40c146c8 --- /dev/null +++ b/services/extraction-service/evals/fixtures/golden.json @@ -0,0 +1,177 @@ +{ + "description": "Golden fixtures for extraction-service evals. Each entry defines the minimum expected extraction classes for a given input. Used by the promptfoo eval suite and can also be consumed by custom assertion scripts.", + "version": "1.0.0", + "tasks": { + "transcript-extraction": { + "expectedClasses": ["action_item", "decision", "question", "deadline", "person", "topic"], + "cases": [ + { + "id": "tc-001", + "input": "John said we need to ship the feature by Friday. Sarah agreed to handle the testing.", + "mustContainClasses": ["action_item", "deadline", "person"], + "mustContainText": ["friday", "sarah", "john"], + "minExtractions": 3 + }, + { + "id": "tc-002", + "input": "The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.", + "mustContainClasses": ["decision", "action_item", "person", "deadline"], + "mustContainText": ["q3", "alice"], + "minExtractions": 3 + }, + { + "id": "tc-003", + "input": "Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.", + "mustContainClasses": ["question", "person"], + "mustContainText": ["bob"], + "minExtractions": 2 + }, + { + "id": "tc-004", + "input": "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?", + "mustContainClasses": ["action_item", "person"], + "mustContainText": ["maria", "tom"], + "minExtractions": 3 + } + ] + }, + "triage": { + "expectedClasses": ["topic", "entity", "action", "emotion", "date_reference", "brain_signal"], + "cases": [ + { + "id": "tr-001", + "input": "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost.", + "mustContainClasses": ["action", "date_reference", "emotion"], + "mustContainBrainSignal": { "brain": "health", "minConfidence": 0.5 }, + "minExtractions": 3 + }, + { + "id": "tr-002", + "input": "Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.", + "mustContainClasses": ["action", "date_reference"], + "mustContainBrainSignal": { "brain": "work", "minConfidence": 0.5 }, + "minExtractions": 2 + }, + { + "id": "tr-003", + "input": "I need to pay the credit card bill before the 15th or I'll get charged interest.", + "mustContainClasses": ["action", "date_reference"], + "mustContainBrainSignal": { "brain": "money", "minConfidence": 0.5 }, + "minExtractions": 2 + }, + { + "id": "tr-004", + "input": "Feeling really overwhelmed today. Too many things on my plate and I can't focus.", + "mustContainClasses": ["emotion"], + "mustContainEmotionValence": "negative", + "minExtractions": 1 + }, + { + "id": "tr-005", + "input": "Doctor said I need to exercise more. Also need to check my 401k contributions before year end.", + "mustContainClasses": ["action"], + "mustContainBrainSignals": [ + { "brain": "health", "minConfidence": 0.4 }, + { "brain": "money", "minConfidence": 0.4 } + ], + "minExtractions": 2 + } + ] + }, + "memory-insight": { + "expectedClasses": ["pattern", "recurring_theme", "relationship", "milestone"], + "cases": [ + { + "id": "mi-001", + "input": "Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.", + "mustContainClasses": ["pattern"], + "mustContainPatternFrequency": "recurring", + "minExtractions": 1 + }, + { + "id": "mi-002", + "input": "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop.", + "mustContainClasses": ["pattern", "relationship"], + "minExtractions": 2 + }, + { + "id": "mi-003", + "input": "Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.", + "mustContainClasses": ["milestone"], + "minExtractions": 2 + }, + { + "id": "mi-004", + "input": "Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.", + "mustContainClasses": ["recurring_theme"], + "minExtractions": 1 + } + ] + }, + "reflection-enrichment": { + "expectedClasses": ["emotional_state", "accomplishment", "concern", "goal_progress"], + "cases": [ + { + "id": "re-001", + "input": "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week.", + "mustContainClasses": ["accomplishment", "concern", "emotional_state"], + "minExtractions": 3 + }, + { + "id": "re-002", + "input": "Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.", + "mustContainClasses": ["emotional_state", "accomplishment"], + "mustContainEmotionValence": "positive", + "minExtractions": 2 + }, + { + "id": "re-003", + "input": "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month.", + "mustContainClasses": ["goal_progress", "accomplishment"], + "minExtractions": 2 + }, + { + "id": "re-004", + "input": "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up.", + "mustContainClasses": ["accomplishment", "concern"], + "minExtractions": 2 + } + ] + }, + "bug-report-extraction": { + "expectedClasses": [ + "steps_to_reproduce", + "expected_behavior", + "actual_behavior", + "affected_component", + "severity" + ], + "cases": [ + { + "id": "br-001", + "input": "When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.", + "mustContainClasses": [ + "steps_to_reproduce", + "expected_behavior", + "actual_behavior", + "affected_component", + "severity" + ], + "mustContainSeverityLevel": "critical", + "minExtractions": 4 + }, + { + "id": "br-002", + "input": "Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.", + "mustContainClasses": [ + "steps_to_reproduce", + "expected_behavior", + "actual_behavior", + "affected_component" + ], + "minExtractions": 3 + } + ] + } + } +} diff --git a/services/extraction-service/evals/promptfoo.yaml b/services/extraction-service/evals/promptfoo.yaml new file mode 100644 index 00000000..07936605 --- /dev/null +++ b/services/extraction-service/evals/promptfoo.yaml @@ -0,0 +1,319 @@ +# promptfoo eval config for extraction-service +# Docs: https://promptfoo.dev/docs/configuration/guide +# +# Usage: +# pnpm eval # run all evals (service must be running on port 4005) +# pnpm eval:task triage # run a single task suite +# pnpm eval:ci # CI mode — fail on any assertion failure +# +# Prerequisites: +# 1. extraction-service running: pnpm dev (port 4005) +# 2. EXTRACTION_EVAL_TOKEN set in env (any valid JWT from platform-service) + +description: Extraction Service — LLM Output Quality Evals + +# ── Provider: the extraction-service HTTP API ──────────────────── +providers: + - id: http + config: + url: "{{env.EXTRACTION_SERVICE_URL | default('http://localhost:4005')}}/api/extract" + method: POST + headers: + Content-Type: application/json + Authorization: 'Bearer {{env.EXTRACTION_EVAL_TOKEN}}' + body: + text: '{{text}}' + taskId: '{{taskId}}' + productId: "{{env.EVAL_PRODUCT_ID | default('lysnrai')}}" + transformResponse: | + return { + extractions: json.extractions, + classes: json.extractions.map(e => e.extraction_class), + texts: json.extractions.map(e => e.extraction_text), + durationMs: json.metadata?.durationMs, + }; + +# ── Default assertion thresholds ──────────────────────────────── +defaultTest: + options: + timeoutMs: 30000 + assert: + - type: latency + threshold: 15000 + +# ── Test suites per task ───────────────────────────────────────── +tests: + # ── transcript-extraction ────────────────────────────────────── + - description: 'transcript: extracts action item and deadline' + vars: + taskId: transcript-extraction + text: 'John said we need to ship the feature by Friday. Sarah agreed to handle the testing.' + assert: + - type: javascript + value: | + output.classes.includes('action_item') + - type: javascript + value: | + output.classes.includes('deadline') + - type: javascript + value: | + output.classes.includes('person') + - type: javascript + value: | + output.texts.some(t => t.toLowerCase().includes('friday') || t.toLowerCase().includes('ship')) + + - description: 'transcript: extracts decision from meeting note' + vars: + taskId: transcript-extraction + text: 'The team decided to postpone the launch to Q3. Alice will notify all stakeholders by Monday.' + assert: + - type: javascript + value: output.classes.includes('decision') + - type: javascript + value: output.classes.includes('action_item') + - type: javascript + value: output.classes.includes('person') + - type: javascript + value: output.classes.includes('deadline') + + - description: 'transcript: extracts question from discussion' + vars: + taskId: transcript-extraction + text: 'Bob asked: should we use Postgres or Cosmos DB for the new service? No decision was made.' + assert: + - type: javascript + value: output.classes.includes('question') + - type: javascript + value: output.classes.includes('person') + + - description: 'transcript: handles multi-person transcript' + vars: + taskId: transcript-extraction + text: "Maria: I finished the design mockups. Tom: Great, I'll review them by EOD. Maria: Can you also check the mobile screens?" + assert: + - type: javascript + value: output.classes.includes('action_item') + - type: javascript + value: output.classes.includes('person') + - type: javascript + value: output.texts.some(t => t.toLowerCase().includes('maria') || t.toLowerCase().includes('tom')) + - type: javascript + value: output.extractions.length >= 3 + + # ── triage ───────────────────────────────────────────────────── + - description: 'triage: health brain signal for medical content' + vars: + taskId: triage + text: "Remind me to call the dentist tomorrow about my appointment. I'm stressed about the cost." + assert: + - type: javascript + value: output.classes.includes('action') + - type: javascript + value: output.classes.includes('date_reference') + - type: javascript + value: output.classes.includes('emotion') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && + e.attributes?.brain === 'health' + ) + + - description: 'triage: work brain signal for project content' + vars: + taskId: triage + text: 'Need to finish the Q1 report for my manager by end of week. The presentation is on Thursday.' + assert: + - type: javascript + value: output.classes.includes('action') + - type: javascript + value: output.classes.includes('date_reference') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && + e.attributes?.brain === 'work' + ) + + - description: 'triage: money brain signal for financial content' + vars: + taskId: triage + text: "I need to pay the credit card bill before the 15th or I'll get charged interest." + assert: + - type: javascript + value: output.classes.includes('action') + - type: javascript + value: output.classes.includes('date_reference') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && + e.attributes?.brain === 'money' + ) + + - description: 'triage: negative emotion detected' + vars: + taskId: triage + text: "Feeling really overwhelmed today. Too many things on my plate and I can't focus." + assert: + - type: javascript + value: output.classes.includes('emotion') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'emotion' && + e.attributes?.valence === 'negative' + ) + + - description: 'triage: multiple brain signals for mixed content' + vars: + taskId: triage + text: 'Doctor said I need to exercise more. Also need to check my 401k contributions before year end.' + assert: + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && e.attributes?.brain === 'health' + ) + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'brain_signal' && e.attributes?.brain === 'money' + ) + + # ── memory-insight ───────────────────────────────────────────── + - description: 'memory-insight: detects recurring pattern' + vars: + taskId: memory-insight + text: 'Item 1: Skipped gym again. Item 2: Feeling tired at work. Item 3: Had coffee at 4pm to stay awake. Item 4: Skipped gym for the third time this week.' + assert: + - type: javascript + value: output.classes.includes('pattern') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'pattern' && + e.attributes?.frequency === 'recurring' + ) + + - description: 'memory-insight: detects relationship between items' + vars: + taskId: memory-insight + text: "Item 1: Stayed up until 2am coding. Item 2: Missed standup the next morning. Item 3: Felt foggy all day. Item 4: Late night again, can't stop." + assert: + - type: javascript + value: output.classes.includes('relationship') + - type: javascript + value: output.classes.includes('pattern') + + - description: 'memory-insight: detects milestone' + vars: + taskId: memory-insight + text: 'Item 1: Started learning Spanish 3 months ago. Item 2: Had first full conversation in Spanish today. Item 3: Completed Duolingo 90-day streak.' + assert: + - type: javascript + value: output.classes.includes('milestone') + - type: javascript + value: output.extractions.length >= 2 + + - description: 'memory-insight: detects recurring theme across entries' + vars: + taskId: memory-insight + text: 'Entry 1: Anxious before the presentation. Entry 2: Nervous about the client call. Entry 3: Worried about the demo tomorrow. Entry 4: Stressed about the board meeting.' + assert: + - type: javascript + value: output.classes.includes('recurring_theme') + - type: javascript + value: output.extractions.length >= 1 + + # ── reflection-enrichment ────────────────────────────────────── + - description: 'reflection: extracts accomplishment and concern' + vars: + taskId: reflection-enrichment + text: "Good day overall. Finally finished the proposal I've been putting off. Still worried about the budget review next week." + assert: + - type: javascript + value: output.classes.includes('accomplishment') + - type: javascript + value: output.classes.includes('concern') + - type: javascript + value: output.classes.includes('emotional_state') + + - description: 'reflection: positive emotional state detected' + vars: + taskId: reflection-enrichment + text: 'Had a fantastic week. Shipped the new feature, got great feedback from users, and the team celebrated together.' + assert: + - type: javascript + value: output.classes.includes('emotional_state') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'emotional_state' && + e.attributes?.valence === 'positive' + ) + - type: javascript + value: output.classes.includes('accomplishment') + + - description: 'reflection: goal progress detected' + vars: + taskId: reflection-enrichment + text: "I've been trying to read more this year. This month I finished my third book — ahead of my goal of one per month." + assert: + - type: javascript + value: output.classes.includes('goal_progress') + - type: javascript + value: output.classes.includes('accomplishment') + + - description: 'reflection: mixed positive and negative signals' + vars: + taskId: reflection-enrichment + text: "Proud of finishing the marathon training plan. But I'm really worried I won't be able to run the actual race — my knee has been acting up." + assert: + - type: javascript + value: output.classes.includes('accomplishment') + - type: javascript + value: output.classes.includes('concern') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'emotional_state' && + e.attributes?.valence === 'positive' + ) + + # ── bug-report-extraction ────────────────────────────────────── + - description: 'bug-report: extracts all 5 fields' + vars: + taskId: bug-report-extraction + text: 'When I click the save button on the settings page, nothing happens. It should save my preferences. This is a critical issue affecting all users.' + assert: + - type: javascript + value: output.classes.includes('steps_to_reproduce') + - type: javascript + value: output.classes.includes('expected_behavior') + - type: javascript + value: output.classes.includes('actual_behavior') + - type: javascript + value: output.classes.includes('affected_component') + - type: javascript + value: output.classes.includes('severity') + - type: javascript + value: | + output.extractions.some(e => + e.extraction_class === 'severity' && + e.attributes?.level === 'critical' + ) + + - description: 'bug-report: extracts steps and component from login bug' + vars: + taskId: bug-report-extraction + text: 'Steps: 1) Open login page, 2) Enter valid credentials, 3) Click login. Expected: redirect to dashboard. Actual: spinner shows forever. Affects the login page on mobile.' + assert: + - type: javascript + value: output.classes.includes('steps_to_reproduce') + - type: javascript + value: output.classes.includes('expected_behavior') + - type: javascript + value: output.classes.includes('actual_behavior') + - type: javascript + value: output.classes.includes('affected_component') diff --git a/services/extraction-service/evals/run-evals.sh b/services/extraction-service/evals/run-evals.sh new file mode 100755 index 00000000..670f43ee --- /dev/null +++ b/services/extraction-service/evals/run-evals.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# run-evals.sh — Run promptfoo evals against the extraction-service +# +# Usage: +# ./evals/run-evals.sh # run all evals +# ./evals/run-evals.sh --task triage # filter by task (grep on description) +# ./evals/run-evals.sh --ci # CI mode: exit 1 on any failure +# ./evals/run-evals.sh --output json # output results as JSON + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SERVICE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" + +# ── Defaults ──────────────────────────────────────────────────── +EXTRACTION_SERVICE_URL="${EXTRACTION_SERVICE_URL:-http://localhost:4005}" +EVAL_PRODUCT_ID="${EVAL_PRODUCT_ID:-lysnrai}" +CI_MODE=false +OUTPUT_FORMAT="text" +TASK_FILTER="" + +# ── Parse args ────────────────────────────────────────────────── +while [[ $# -gt 0 ]]; do + case "$1" in + --ci) CI_MODE=true; shift ;; + --output) OUTPUT_FORMAT="$2"; shift 2 ;; + --task) TASK_FILTER="$2"; shift 2 ;; + *) echo "Unknown arg: $1"; exit 1 ;; + esac +done + +# ── Check service is reachable ─────────────────────────────────── +echo "→ Checking extraction-service at $EXTRACTION_SERVICE_URL ..." +if ! curl -sf "$EXTRACTION_SERVICE_URL/health" > /dev/null 2>&1; then + echo "✗ extraction-service is not running at $EXTRACTION_SERVICE_URL" + echo " Start it with: pnpm dev (in services/extraction-service/)" + exit 1 +fi +echo "✓ Service is up" + +# ── Check EXTRACTION_EVAL_TOKEN ────────────────────────────────── +if [[ -z "${EXTRACTION_EVAL_TOKEN:-}" ]]; then + echo "⚠ EXTRACTION_EVAL_TOKEN is not set — evals will fail auth" + echo " Get a token from platform-service: POST /api/auth/login" + echo " Then: export EXTRACTION_EVAL_TOKEN=" + if [[ "$CI_MODE" == "true" ]]; then + exit 1 + fi +fi + +# ── Build promptfoo args ───────────────────────────────────────── +PROMPTFOO_ARGS=( + eval + --config "$SCRIPT_DIR/promptfoo.yaml" + --output "$OUTPUT_FORMAT" + --no-cache +) + +if [[ "$CI_MODE" == "true" ]]; then + PROMPTFOO_ARGS+=(--no-progress-bar) +fi + +if [[ -n "$TASK_FILTER" ]]; then + PROMPTFOO_ARGS+=(--filter-description "$TASK_FILTER") +fi + +# ── Run ───────────────────────────────────────────────────────── +echo "→ Running evals (task: ${TASK_FILTER:-all}) ..." +echo "" + +export EXTRACTION_SERVICE_URL +export EXTRACTION_EVAL_TOKEN +export EVAL_PRODUCT_ID + +cd "$SERVICE_DIR" +npx promptfoo "${PROMPTFOO_ARGS[@]}" + +EXIT_CODE=$? + +if [[ $EXIT_CODE -eq 0 ]]; then + echo "" + echo "✓ All evals passed" +else + echo "" + echo "✗ Some evals failed (exit $EXIT_CODE)" + if [[ "$CI_MODE" == "true" ]]; then + exit $EXIT_CODE + fi +fi