- Add evals/promptfoo.yaml: HTTP provider hitting extraction-service API covering all 5 built-in tasks (transcript, triage, memory-insight, reflection-enrichment, bug-report-extraction) - Add evals/fixtures/golden.json: machine-readable golden input/output fixtures - Add evals/run-evals.sh: shell runner with health checks, auth token handling, task filtering, and CI mode - Add evals/README.md: usage docs, prerequisites, cost estimates, CI integration
195 lines
6.3 KiB
Markdown
195 lines
6.3 KiB
Markdown
# Extraction Service — LLM Evals
|
|
|
|
Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).
|
|
|
|
## Structure
|
|
|
|
```
|
|
evals/
|
|
├── promptfoo.yaml # Main eval config — all test cases + assertions
|
|
├── fixtures/
|
|
│ └── golden.json # Golden input/output fixtures (machine-readable)
|
|
├── run-evals.sh # Shell runner with health check + auth guard
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
1. **extraction-service running** on port 4005:
|
|
|
|
```bash
|
|
pnpm dev
|
|
```
|
|
|
|
2. **Auth token** from platform-service:
|
|
|
|
```bash
|
|
# POST /api/auth/login → copy the token
|
|
export EXTRACTION_EVAL_TOKEN=<your-jwt>
|
|
```
|
|
|
|
3. **promptfoo** (installed automatically via `npx`, or globally):
|
|
```bash
|
|
npm install -g promptfoo
|
|
```
|
|
|
|
## Running Evals
|
|
|
|
```bash
|
|
# All tasks
|
|
pnpm eval
|
|
|
|
# Single task
|
|
pnpm eval:task triage
|
|
pnpm eval:task transcript-extraction
|
|
pnpm eval:task memory-insight
|
|
pnpm eval:task reflection-enrichment
|
|
pnpm eval:task bug-report-extraction
|
|
|
|
# CI mode (exits non-zero on any failure)
|
|
pnpm eval:ci
|
|
|
|
# JSON output (for dashboards / reporting)
|
|
pnpm eval:json
|
|
```
|
|
|
|
## What's Tested
|
|
|
|
| Task | Cases | Key Assertions |
|
|
| ----------------------- | ----- | ------------------------------------------------------------------------- |
|
|
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present |
|
|
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference |
|
|
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
|
|
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
|
|
| `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute |
|
|
|
|
**Total: 19 eval cases, 60+ assertions**
|
|
|
|
## Adding New Cases
|
|
|
|
1. Add a test case to `promptfoo.yaml` under `tests:`:
|
|
|
|
```yaml
|
|
- description: 'triage: home brain signal for household content'
|
|
vars:
|
|
taskId: triage
|
|
text: 'Need to fix the leaking faucet in the kitchen this weekend.'
|
|
assert:
|
|
- type: javascript
|
|
value: output.classes.includes('action')
|
|
- type: javascript
|
|
value: |
|
|
output.extractions.some(e =>
|
|
e.extraction_class === 'brain_signal' &&
|
|
e.attributes?.brain === 'home'
|
|
)
|
|
```
|
|
|
|
2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.
|
|
|
|
## Local OSS Models (Ollama)
|
|
|
|
Run the same 19 eval cases against a local model — zero API cost, fully private.
|
|
|
|
### Install Ollama
|
|
|
|
```bash
|
|
# 1. Install
|
|
brew install ollama
|
|
|
|
# 2. Start server (keep this terminal open)
|
|
ollama serve
|
|
|
|
# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
|
|
ollama pull llama3.1:8b
|
|
|
|
# If corp proxy blocks it, try bypassing:
|
|
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b
|
|
|
|
# 4. Verify
|
|
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
|
|
```
|
|
|
|
### Run Ollama Evals
|
|
|
|
```bash
|
|
# Default model: llama3.1:8b
|
|
pnpm eval:ollama
|
|
|
|
# Different model (once pulled)
|
|
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
|
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
|
|
```
|
|
|
|
> **Note:** Ollama evals hit the model directly (no extraction-service needed).
|
|
> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.
|
|
|
|
### Compare Gemini vs Ollama
|
|
|
|
Runs both suites and prints a side-by-side pass-rate table:
|
|
|
|
```bash
|
|
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
|
|
pnpm eval:compare
|
|
|
|
# With a different local model
|
|
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
|
|
```
|
|
|
|
Example output:
|
|
|
|
```
|
|
Provider Passed Total Rate Progress
|
|
───────────────────────────────────────────────────────
|
|
Gemini 57 60 95% █████████░
|
|
Ollama 48 60 80% ████████░░
|
|
|
|
Per-task breakdown:
|
|
Task Gemini Ollama
|
|
──────────────────────────────────────────────────
|
|
triage 20/20 (100%) 16/20 (80%)
|
|
transcript-extraction 15/16 (94%) 12/16 (75%)
|
|
reflection-enrichment 12/12 (100%) 10/12 (83%)
|
|
memory-insight 8/8 (100%) 7/8 (88%)
|
|
bug-report-extraction 2/4 (50%) 3/4 (75%)
|
|
```
|
|
|
|
### Supported Local Models
|
|
|
|
| Model | Pull command | RAM needed | Notes |
|
|
| -------------- | -------------------------- | ---------- | --------------------------- |
|
|
| `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default |
|
|
| `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output |
|
|
| `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning |
|
|
| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ |
|
|
|
|
## CI Integration
|
|
|
|
Add to your GitHub Actions workflow:
|
|
|
|
```yaml
|
|
- name: Run extraction evals
|
|
env:
|
|
EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
|
|
EXTRACTION_SERVICE_URL: http://localhost:4005
|
|
run: pnpm --filter @lysnrai/extraction-service eval:ci
|
|
```
|
|
|
|
The service must be started before running evals in CI.
|
|
|
|
## Interpreting Results
|
|
|
|
promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.
|
|
|
|
Common failure patterns:
|
|
|
|
- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
|
|
- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
|
|
- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs
|
|
|
|
## Thresholds
|
|
|
|
- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
|
|
- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
|
|
- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics
|