saravanakumardb1 80f794dee7 docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management,
  idle timeout, manual load/unload, OpenAI-compatible API, native API reference,
  performance tuning flags (flash attention, KV cache)
- docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases
  across 5 tasks, assertion patterns for JSON string output, Python sidecar config
- docs/09-environment-variables.md: comprehensive var reference for Ollama server,
  evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings

2026-02-19 13:01:05 -08:00

4.3 KiB

Raw Blame History

06 — Extraction Service Evals

Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.

Overview

The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:

Zero-cost iteration on extraction prompts
Side-by-side comparison of local vs cloud model quality
Offline development when cloud APIs are unavailable

Files

File	Purpose
`services/extraction-service/evals/promptfoo.yaml`	Gemini evals (via extraction-service HTTP API)
`services/extraction-service/evals/promptfoo.ollama.yaml`	Same 19 cases, hits Ollama directly
`services/extraction-service/evals/compare-evals.sh`	Side-by-side Gemini vs Ollama pass-rate comparison
`services/extraction-service/evals/fixtures/golden.json`	Machine-readable golden fixtures
`services/extraction-service/evals/README.md`	Full usage docs

Running Evals

cd services/extraction-service

# Ollama only (no extraction-service needed)
pnpm eval:ollama

# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama

# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare

Prerequisites

Ollama must be running (ollama serve)
A model must be available (ollama pull llama3.1:8b)
For comparison: extraction-service must be running with EXTRACTION_EVAL_TOKEN set

Eval Coverage

Task	Cases	Key Assertions
`transcript-extraction`	4	action_item, deadline, person, decision, question
`triage`	5	brain_signal routing (health/work/money), emotion valence
`memory-insight`	4	pattern frequency, relationship, milestone, recurring_theme
`reflection-enrichment`	4	emotional_state valence, accomplishment, concern, goal_progress
`bug-report-extraction`	2	all 5 fields, severity level attribute

Total: 19 cases, 50+ assertions

Important: Assertion Pattern

Ollama returns a raw JSON string — assertions must parse it inline:

# ✅ Correct — parse the string first
- type: javascript
  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"

# ❌ Wrong — output is a string, not an object
- type: javascript
  value: output.classes.includes('action')

Pointing the Python Sidecar at Ollama

The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:

export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b

Check services/extraction-service/python/ for exact env var names — the sidecar config may use different keys depending on LangExtract version.

Cost Comparison

Provider	Cost per full run	Notes
Gemini (via extraction-service)	~$0.003–0.005	gemini-2.5-flash
Ollama (local)	$0.00	Fully offline after model download

Recommended Models for Evals

Model	JSON Quality	Speed	Notes
`llama3.1:8b`	Good	Fast	Default, reliable JSON output
`qwen2.5:7b`	Excellent	Fast	Best JSON structure compliance
`qwen2.5-coder:32b`	Excellent	Moderate	Best quality, slower
`phi4`	Good	Fast	Good reasoning for triage tasks

4.3 KiB Raw Blame History Unescape Escape