diff --git a/__LOCAL_LLMs/docs/06-extraction-service-evals.md b/__LOCAL_LLMs/docs/06-extraction-service-evals.md index f0221b41..d7366ec3 100644 --- a/__LOCAL_LLMs/docs/06-extraction-service-evals.md +++ b/__LOCAL_LLMs/docs/06-extraction-service-evals.md @@ -65,16 +65,29 @@ pnpm eval:compare ## Important: Assertion Pattern -Ollama returns a raw JSON **string** — assertions must parse it inline: +Ollama returns a raw JSON **string** — assertions must be single **expressions** (no `const`/`return` blocks): ```yaml -# ✅ Correct — parse the string first +# ✅ Correct — single expression with function(e){return ...} +- type: javascript + value: "JSON.parse(output).extractions.map(function(e){return e.extraction_class}).includes('action')" + +# ❌ Wrong — statement block causes SyntaxError: Unexpected token 'return' - type: javascript value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');" +``` -# ❌ Wrong — output is a string, not an object -- type: javascript - value: output.classes.includes('action') +### DeepSeek R1 — Strip `` blocks + +R1 models emit reasoning traces before JSON. Add a provider-level transform: + +```yaml +providers: + - id: openai:chat:deepseek-r1:32b + config: + apiBaseUrl: http://localhost:11434/v1 + apiKey: ollama + transform: "output.replace(/[\\s\\S]*?<\/think>/g, '').trim()" ``` --- @@ -94,20 +107,34 @@ export LANGEXTRACT_MODEL=llama3.1:8b --- -## Cost Comparison +## Latency & Cost Comparison -| Provider | Cost per full run | Notes | -| ----------------------------------- | ----------------- | ---------------------------------- | -| **Gemini** (via extraction-service) | ~$0.003–0.005 | gemini-2.5-flash | -| **Ollama** (local) | $0.00 | Fully offline after model download | +Measured on M4 Pro 48 GB, 19 eval cases, concurrency=4: + +| Provider | Model | Duration (19 cases) | Per case avg | Tokens | Cost per run | +| ---------------- | ------------------- | ------------------- | ------------ | ------ | ------------- | +| **Ollama local** | `llama3.1:8b` | ~1m 27s | ~4.6s | ~7,300 | **$0.00** | +| **Ollama local** | `qwen2.5-coder:32b` | ~5–8m (est.) | ~15–25s | ~7,300 | **$0.00** | +| **Ollama local** | `deepseek-r1:32b` | ~5–8m (est.) | ~15–25s | ~7,300 | **$0.00** | +| **Google Cloud** | `gemini-2.5-flash` | ~15–25s | ~1s | ~7,300 | ~$0.003–0.005 | +| **Azure OpenAI** | `gpt-4o` | ~20–40s | ~1–2s | ~7,300 | ~$0.05–0.15 | + +**Key takeaway:** Cloud models are 5–6x faster per request due to massive parallel GPU infrastructure. Local wins on cost, privacy, and no proxy/quota issues. --- ## Recommended Models for Evals -| Model | JSON Quality | Speed | Notes | -| ------------------- | ------------ | -------- | ------------------------------- | -| `llama3.1:8b` | Good | Fast | Default, reliable JSON output | -| `qwen2.5:7b` | Excellent | Fast | Best JSON structure compliance | -| `qwen2.5-coder:32b` | Excellent | Moderate | Best quality, slower | -| `phi4` | Good | Fast | Good reasoning for triage tasks | +| Model | Disk | JSON Quality | Reasoning | Speed | Pass Rate | Notes | +| ------------------- | ----- | ------------ | --------- | -------- | --------- | ------------------------------------------------ | +| `llama3.1:8b` | 4.9GB | Good | Basic | Fast | 19/19 ✅ | Default — tuned assertions for 8B behavior gaps | +| `qwen2.5-coder:32b` | 19GB | Excellent | Good | Moderate | TBD | Best JSON compliance, strong structured output | +| `deepseek-r1:32b` | 20GB | Good\* | Excellent | Moderate | TBD | \*Requires `` strip transform — see above | +| `qwen2.5:7b` | 5GB | Excellent | Basic | Fast | TBD | Best JSON structure at 7B size | +| `phi4:14b` | 9GB | Good | Good | Fast | TBD | Strong reasoning for triage tasks | + +Run any model with: + +```bash +OLLAMA_MODEL= ./evals/run-ollama-evals-logged.sh +```