learning_ai_common_plat/services/extraction-service/evals/README.md

# Extraction Service — LLM Evals

Quality evals for all 5 built-in extraction tasks using [promptfoo](https://promptfoo.dev).

## Structure

```
evals/
├── promptfoo.yaml          # Main eval config — all test cases + assertions
├── fixtures/
│   └── golden.json         # Golden input/output fixtures (machine-readable)
├── run-evals.sh            # Shell runner with health check + auth guard
└── README.md               # This file
```

## Prerequisites

1. **extraction-service running** on port 4005:

   ```bash
   pnpm dev
   ```

2. **Auth token** from platform-service:

   ```bash
   # POST /api/auth/login → copy the token
   export EXTRACTION_EVAL_TOKEN=<your-jwt>
   ```

3. **promptfoo** (installed automatically via `npx`, or globally):
   ```bash
   npm install -g promptfoo
   ```

## Running Evals

```bash
# All tasks
pnpm eval

# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction

# CI mode (exits non-zero on any failure)
pnpm eval:ci

# JSON output (for dashboards / reporting)
pnpm eval:json
```

## What's Tested

| Task                    | Cases | Key Assertions                                                            |
| ----------------------- | ----- | ------------------------------------------------------------------------- |
| `transcript-extraction` | 4     | action_item, deadline, person, decision, question classes present         |
| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence, date_reference |
| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme               |
| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress           |
| `bug-report-extraction` | 2     | all 5 fields extracted, severity level attribute                          |

**Total: 19 eval cases, 60+ assertions**

## Adding New Cases

1. Add a test case to `promptfoo.yaml` under `tests:`:

   ```yaml
   - description: 'triage: home brain signal for household content'
     vars:
       taskId: triage
       text: 'Need to fix the leaking faucet in the kitchen this weekend.'
     assert:
       - type: javascript
         value: output.classes.includes('action')
       - type: javascript
         value: |
           output.extractions.some(e =>
             e.extraction_class === 'brain_signal' &&
             e.attributes?.brain === 'home'
           )
   ```

2. Optionally add the fixture to `fixtures/golden.json` for machine-readable tracking.

## Local OSS Models (Ollama)

Run the same 19 eval cases against a local model — zero API cost, fully private.

### Install Ollama

```bash
# 1. Install
brew install ollama

# 2. Start server (keep this terminal open)
ollama serve

# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b

# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b

# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap
```

### Run Ollama Evals

```bash
# Default model: llama3.1:8b
pnpm eval:ollama

# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama
```

> **Note:** Ollama evals hit the model directly (no extraction-service needed).
> Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.

### Compare Gemini vs Ollama

Runs both suites and prints a side-by-side pass-rate table:

```bash
# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare

# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare
```

Example output:

```
  Provider     Passed     Total    Rate     Progress
  ───────────────────────────────────────────────────────
  Gemini       57         60       95%      █████████░
  Ollama       48         60       80%      ████████░░

  Per-task breakdown:
  Task                      Gemini       Ollama
  ──────────────────────────────────────────────────
  triage                    20/20 (100%) 16/20 (80%)
  transcript-extraction     15/16 (94%)  12/16 (75%)
  reflection-enrichment     12/12 (100%) 10/12 (83%)
  memory-insight            8/8 (100%)   7/8 (88%)
  bug-report-extraction     2/4 (50%)    3/4 (75%)
```

### Supported Local Models

| Model          | Pull command               | RAM needed | Notes                       |
| -------------- | -------------------------- | ---------- | --------------------------- |
| `llama3.1:8b`  | `ollama pull llama3.1:8b`  | ~6GB       | Best default                |
| `qwen2.5:7b`   | `ollama pull qwen2.5:7b`   | ~5GB       | Strong JSON output          |
| `phi4:14b`     | `ollama pull phi4`         | ~9GB       | Good reasoning              |
| `llama3.3:70b` | `ollama pull llama3.3:70b` | ~40GB      | Best quality, needs M2 Max+ |

## CI Integration

Add to your GitHub Actions workflow:

```yaml
- name: Run extraction evals
  env:
    EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
    EXTRACTION_SERVICE_URL: http://localhost:4005
  run: pnpm --filter @lysnrai/extraction-service eval:ci
```

The service must be started before running evals in CI.

## Interpreting Results

promptfoo outputs a table showing pass/fail per assertion. A case fails if **any** assertion fails.

Common failure patterns:

- **Missing class** — model didn't extract that entity type; consider adding more examples to the task seed
- **Wrong attribute** — `brain_signal.brain` or `emotion.valence` incorrect; refine the task prompt in `seed.ts`
- **Latency > 15s** — sidecar overloaded or model cold-starting; check Python sidecar logs

## Thresholds

- **Latency:** 15s max per extraction (default; adjust `threshold` in `promptfoo.yaml`)
- **Pass rate target:** 100% for golden cases (these are deterministic enough inputs)
- **LLM-as-judge:** Not yet implemented — add when you have enough production data to define rubrics