learning_ai_common_plat/services/extraction-service/evals
saravanakumardb1 0c4210f5ff docs(local-llm): update original setup doc to redirect to docs/ structure
- LOCAL_LLMs_setup_mac_m4_48gb.md: replace 279-line monolith with quick start
  + documentation index linking to 9 topic-specific docs in docs/
- Add .gitignore for extraction-service eval logs (generated artifacts)
2026-02-19 13:01:35 -08:00
..
fixtures feat(extraction-service): scaffold promptfoo eval suite with 19 test cases 2026-02-19 12:19:16 -08:00
.gitignore docs(local-llm): update original setup doc to redirect to docs/ structure 2026-02-19 13:01:35 -08:00
compare-evals.sh feat(extraction-service): add Ollama local model eval config and compare script 2026-02-19 12:19:24 -08:00
promptfoo.ollama.yaml fix(extraction-service): fix Ollama eval assertions — 19/19 passing (100%) 2026-02-19 12:54:34 -08:00
promptfoo.yaml feat(extraction-service): scaffold promptfoo eval suite with 19 test cases 2026-02-19 12:19:16 -08:00
README.md feat(extraction-service): scaffold promptfoo eval suite with 19 test cases 2026-02-19 12:19:16 -08:00
run-evals.sh feat(extraction-service): scaffold promptfoo eval suite with 19 test cases 2026-02-19 12:19:16 -08:00
run-ollama-evals-logged.sh feat(extraction-service): add unattended eval runner with structured logging 2026-02-19 12:19:34 -08:00

Extraction Service — LLM Evals

Quality evals for all 5 built-in extraction tasks using promptfoo.

Structure

evals/
├── promptfoo.yaml          # Main eval config — all test cases + assertions
├── fixtures/
│   └── golden.json         # Golden input/output fixtures (machine-readable)
├── run-evals.sh            # Shell runner with health check + auth guard
└── README.md               # This file

Prerequisites

  1. extraction-service running on port 4005:

    pnpm dev
    
  2. Auth token from platform-service:

    # POST /api/auth/login → copy the token
    export EXTRACTION_EVAL_TOKEN=<your-jwt>
    
  3. promptfoo (installed automatically via npx, or globally):

    npm install -g promptfoo
    

Running Evals

# All tasks
pnpm eval

# Single task
pnpm eval:task triage
pnpm eval:task transcript-extraction
pnpm eval:task memory-insight
pnpm eval:task reflection-enrichment
pnpm eval:task bug-report-extraction

# CI mode (exits non-zero on any failure)
pnpm eval:ci

# JSON output (for dashboards / reporting)
pnpm eval:json

What's Tested

Task Cases Key Assertions
transcript-extraction 4 action_item, deadline, person, decision, question classes present
triage 5 brain_signal routing (health/work/money), emotion valence, date_reference
memory-insight 4 pattern frequency, relationship, milestone, recurring_theme
reflection-enrichment 4 emotional_state valence, accomplishment, concern, goal_progress
bug-report-extraction 2 all 5 fields extracted, severity level attribute

Total: 19 eval cases, 60+ assertions

Adding New Cases

  1. Add a test case to promptfoo.yaml under tests::

    - description: 'triage: home brain signal for household content'
      vars:
        taskId: triage
        text: 'Need to fix the leaking faucet in the kitchen this weekend.'
      assert:
        - type: javascript
          value: output.classes.includes('action')
        - type: javascript
          value: |
            output.extractions.some(e =>
              e.extraction_class === 'brain_signal' &&
              e.attributes?.brain === 'home'
            )        
    
  2. Optionally add the fixture to fixtures/golden.json for machine-readable tracking.

Local OSS Models (Ollama)

Run the same 19 eval cases against a local model — zero API cost, fully private.

Install Ollama

# 1. Install
brew install ollama

# 2. Start server (keep this terminal open)
ollama serve

# 3. Pull Llama 3.1 8B (4.7GB — needs internet, may be blocked by corp proxy)
ollama pull llama3.1:8b

# If corp proxy blocks it, try bypassing:
HTTPS_PROXY="" HTTP_PROXY="" ollama pull llama3.1:8b

# 4. Verify
ollama run llama3.1:8b "List action items: Sarah will call the dentist by Friday." --nowordwrap

Run Ollama Evals

# Default model: llama3.1:8b
pnpm eval:ollama

# Different model (once pulled)
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
OLLAMA_MODEL=phi4:14b pnpm eval:ollama

Note: Ollama evals hit the model directly (no extraction-service needed). Timeout is 45s per case (vs 15s for Gemini) — local inference is slower.

Compare Gemini vs Ollama

Runs both suites and prints a side-by-side pass-rate table:

# Requires: extraction-service running + EXTRACTION_EVAL_TOKEN set + ollama serve running
pnpm eval:compare

# With a different local model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:compare

Example output:

  Provider     Passed     Total    Rate     Progress
  ───────────────────────────────────────────────────────
  Gemini       57         60       95%      █████████░
  Ollama       48         60       80%      ████████░░

  Per-task breakdown:
  Task                      Gemini       Ollama
  ──────────────────────────────────────────────────
  triage                    20/20 (100%) 16/20 (80%)
  transcript-extraction     15/16 (94%)  12/16 (75%)
  reflection-enrichment     12/12 (100%) 10/12 (83%)
  memory-insight            8/8 (100%)   7/8 (88%)
  bug-report-extraction     2/4 (50%)    3/4 (75%)

Supported Local Models

Model Pull command RAM needed Notes
llama3.1:8b ollama pull llama3.1:8b ~6GB Best default
qwen2.5:7b ollama pull qwen2.5:7b ~5GB Strong JSON output
phi4:14b ollama pull phi4 ~9GB Good reasoning
llama3.3:70b ollama pull llama3.3:70b ~40GB Best quality, needs M2 Max+

CI Integration

Add to your GitHub Actions workflow:

- name: Run extraction evals
  env:
    EXTRACTION_EVAL_TOKEN: ${{ secrets.EXTRACTION_EVAL_TOKEN }}
    EXTRACTION_SERVICE_URL: http://localhost:4005
  run: pnpm --filter @lysnrai/extraction-service eval:ci

The service must be started before running evals in CI.

Interpreting Results

promptfoo outputs a table showing pass/fail per assertion. A case fails if any assertion fails.

Common failure patterns:

  • Missing class — model didn't extract that entity type; consider adding more examples to the task seed
  • Wrong attributebrain_signal.brain or emotion.valence incorrect; refine the task prompt in seed.ts
  • Latency > 15s — sidecar overloaded or model cold-starting; check Python sidecar logs

Thresholds

  • Latency: 15s max per extraction (default; adjust threshold in promptfoo.yaml)
  • Pass rate target: 100% for golden cases (these are deterministic enough inputs)
  • LLM-as-judge: Not yet implemented — add when you have enough production data to define rubrics