learning_ai_common_plat/services/extraction-service/evals/README.md
saravanakumardb1 acd4c3542b feat(extraction-service): scaffold promptfoo eval suite with 19 test cases
- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API
  covering all 5 built-in tasks (transcript, triage, memory-insight,
  reflection-enrichment, bug-report-extraction)
- Add evals/fixtures/golden.json: machine-readable golden input/output fixtures
- Add evals/run-evals.sh: shell runner with health checks, auth token
  handling, task filtering, and CI mode
- Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
2026-02-19 12:19:16 -08:00

195 lines
6.3 KiB
Markdown

# Extraction Service — LLM Evals
Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).
## Structure
```
evals/
├── promptfoo.yaml # Main eval config — all test cases + assertions
├── fixtures/
│ └── golden.json # Golden input/output fixtures (machine-readable)
├── run-evals.sh # Shell runner with health check + auth guard
└── README.md # This file
```
## Prerequisites
1. **extraction-service running** on port 4005:
```bash
pnpm dev
```
2. **Auth token** from platform-service:
```bash
# POST /api/auth/login → copy the token
export EXTRACTION_EVAL_TOKEN=<your-jwt>
```
3. **promptfoo** (installed automatically via `npx`, or globally):
```bash
npm install -g promptfoo
```
## Running Evals
```bash
# All tasks
pnpm eval
# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction
# CI mode (exits non-zero on any failure)
pnpm eval:ci
# JSON output (for dashboards / reporting)
pnpm eval:json
```
## What's Tested
| Task | Cases | Key Assertions |
| ----------------------- | ----- | ------------------------------------------------------------------------- |
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present |
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference |
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute |
**Total: 19 eval cases, 60+ assertions**
## Adding New Cases
1. Add a test case to `promptfoo.yaml` under `tests:`:
```yaml
- description: 'triage: home brain signal for household content'
vars:
taskId: triage
text: 'Need to fix the leaking faucet in the kitchen this weekend.'
assert:
- type: javascript
value: output.classes.includes('action')
- type: javascript
value: |
output.extractions.some(e =>
e.extraction_class === 'brain_signal' &&
e.attributes?.brain === 'home'
)
```
2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.
## Local OSS Models (Ollama)
Run the same 19 eval cases against a local model — zero API cost, fully private.
### Install Ollama
```bash
# 1. Install
brew install ollama
# 2. Start server (keep this terminal open)
ollama serve
# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b
# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
```
### Run Ollama Evals
```bash
# Default model: llama3.1:8b
pnpm eval:ollama
# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
```
> **Note:** Ollama evals hit the model directly (no extraction-service needed).
> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
### Compare Gemini vs Ollama
Runs both suites and prints a side-by-side pass-rate table:
```bash
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare
# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
```
Example output:
```
Provider Passed Total Rate Progress
───────────────────────────────────────────────────────
Gemini 57 60 95% █████████░
Ollama 48 60 80% ████████░░
Per-task breakdown:
Task Gemini Ollama
──────────────────────────────────────────────────
triage 20/20 (100%) 16/20 (80%)
transcript-extraction 15/16 (94%) 12/16 (75%)
reflection-enrichment 12/12 (100%) 10/12 (83%)
memory-insight 8/8 (100%) 7/8 (88%)
bug-report-extraction 2/4 (50%) 3/4 (75%)
```
### Supported Local Models
| Model | Pull command | RAM needed | Notes |
| -------------- | -------------------------- | ---------- | --------------------------- |
| `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default |
| `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output |
| `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning |
| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ |
## CI Integration
Add to your GitHub Actions workflow:
```yaml
- name: Run extraction evals
env:
EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
EXTRACTION_SERVICE_URL: http://localhost:4005
run: pnpm --filter @lysnrai/extraction-service eval:ci
```
The service must be started before running evals in CI.
## Interpreting Results
promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.
Common failure patterns:
- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs
## Thresholds
- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics