# Extraction Service — LLM Evals Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev). ## Structure ``` evals/ ├── promptfoo.yaml # Main eval config — all test cases + assertions ├── fixtures/ │ └── golden.json # Golden input/output fixtures (machine-readable) ├── run-evals.sh # Shell runner with health check + auth guard └── README.md # This file ``` ## Prerequisites 1. **extraction-service running** on port 4005: ```bash pnpm dev ``` 2. **Auth token** from platform-service: ```bash # POST /api/auth/login → copy the token export EXTRACTION_EVAL_TOKEN= ``` 3. **promptfoo** (installed automatically via `npx`, or globally): ```bash npm install -g promptfoo ``` ## Running Evals ```bash # All tasks pnpm eval # Single task pnpm eval:task triage pnpm eval:task transcript-extraction pnpm eval:task memory-insight pnpm eval:task reflection-enrichment pnpm eval:task bug-report-extraction # CI mode (exits non-zero on any failure) pnpm eval:ci # JSON output (for dashboards / reporting) pnpm eval:json ``` ## What's Tested | Task | Cases | Key Assertions | | ----------------------- | ----- | ------------------------------------------------------------------------- | | `transcript-extraction` | 4 | action_item, deadline, person, decision, question classes present | | `triage` | 5 | brain_signal routing (health/work/money), emotion valence, date_reference | | `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | | `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | | `bug-report-extraction` | 2 | all 5 fields extracted, severity level attribute | **Total: 19 eval cases, 60+ assertions** ## Adding New Cases 1. Add a test case to `promptfoo.yaml` under `tests:`: ```yaml - description: 'triage: home brain signal for household content' vars: taskId: triage text: 'Need to fix the leaking faucet in the kitchen this weekend.' assert: - type: javascript value: output.classes.includes('action') - type: javascript value: | output.extractions.some(e => e.extraction_class === 'brain_signal' && e.attributes?.brain === 'home' ) ``` 2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking. ## Local OSS Models (Ollama) Run the same 19 eval cases against a local model — zero API cost, fully private. ### Install Ollama ```bash # 1. Install brew install ollama # 2. Start server (keep this terminal open) ollama serve # 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy) ollama pull llama3.1:8b # If corp proxy blocks it, try bypassing: HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b # 4. Verify ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap ``` ### Run Ollama Evals ```bash # Default model: llama3.1:8b pnpm eval:ollama # Different model (once pulled) OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama OLLAMA_MODEL=phi4:14b pnpm eval:ollama ``` > **Note:** Ollama evals hit the model directly (no extraction-service needed). > Timeout is 45s per case (vs 15s for Gemini) — local inference is slower. ### Compare Gemini vs Ollama Runs both suites and prints a side-by-side pass-rate table: ```bash # Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running pnpm eval:compare # With a different local model OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare ``` Example output: ``` Provider Passed Total Rate Progress ─────────────────────────────────────────────────────── Gemini 57 60 95% █████████░ Ollama 48 60 80% ████████░░ Per-task breakdown: Task Gemini Ollama ────────────────────────────────────────────────── triage 20/20 (100%) 16/20 (80%) transcript-extraction 15/16 (94%) 12/16 (75%) reflection-enrichment 12/12 (100%) 10/12 (83%) memory-insight 8/8 (100%) 7/8 (88%) bug-report-extraction 2/4 (50%) 3/4 (75%) ``` ### Supported Local Models | Model | Pull command | RAM needed | Notes | | -------------- | -------------------------- | ---------- | --------------------------- | | `llama3.1:8b` | `ollama pull llama3.1:8b` | ~6GB | Best default | | `qwen2.5:7b` | `ollama pull qwen2.5:7b` | ~5GB | Strong JSON output | | `phi4:14b` | `ollama pull phi4` | ~9GB | Good reasoning | | `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB | Best quality, needs M2 Max+ | ## CI Integration Add to your GitHub Actions workflow: ```yaml - name: Run extraction evals env: EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }} EXTRACTION_SERVICE_URL: http://localhost:4005 run: pnpm --filter @lysnrai/extraction-service eval:ci ``` The service must be started before running evals in CI. ## Interpreting Results promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails. Common failure patterns: - **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed - **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts` - **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs ## Thresholds - **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`) - **Pass rate target:** 100% for golden cases (these are deterministic enough inputs) - **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics