docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management, idle timeout, manual load/unload, OpenAI-compatible API, native API reference, performance tuning flags (flash attention, KV cache) - docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases across 5 tasks, assertion patterns for JSON string output, Python sidecar config - docs/09-environment-variables.md: comprehensive var reference for Ollama server, evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
2026-02-19 13:01:05 -08:00 · 2026-02-19 13:01:05 -08:00 · 80f794dee7
commit 80f794dee7
parent 464ffb92ec
3 changed files with 478 additions and 0 deletions
--- a/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
+++ b/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
@ -0,0 +1,230 @@
+# 02 — Ollama Setup & Models
+
+> Installation, server configuration, model management, and memory behavior.
+
+---
+
+## Installation
+
+```bash
+brew install ollama
+```
+
+- **Version installed:** 0.16.2
+- **Binary:** `/opt/homebrew/opt/ollama/bin/ollama`
+- **Models stored at:** `~/.ollama/models/`
+- **Config:** No config file — uses environment variables
+
+---
+
+## Starting the Server
+
+```bash
+# Option A: foreground (dev, see logs)
+ollama serve
+
+# Option B: background service (auto-start at login)
+brew services start ollama
+
+# Check if running
+curl http://localhost:11434/api/tags
+```
+
+**Server listens on:** `http://127.0.0.1:11434`
+
+### Corporate Proxy Note
+
+Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails:
+
+```bash
+NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
+```
+
+---
+
+## Models Currently Installed
+
+Verified 2026-02-19:
+
+| Model               | Size   | Pull Command                    | Use Case                                      |
+| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
+| `qwen2.5-coder:32b` | 19 GB  | `ollama pull qwen2.5-coder:32b` | Best coding model — Swift, TypeScript, Python |
+| `llama3.1:8b`       | 4.9 GB | `ollama pull llama3.1:8b`       | Default for evals, fast inference             |
+
+### Useful Commands
+
+```bash
+# List all downloaded models (disk)
+ollama list
+
+# Show what's currently loaded in RAM
+ollama ps
+
+# Pull a new model (downloads to ~/.ollama/models/)
+ollama pull <model>
+
+# Run interactively
+ollama run <model>
+
+# Run with a one-shot prompt
+ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"
+
+# Remove a model from disk
+ollama rm <model>
+
+# Show model details (size, parameters, template)
+ollama show <model>
+```
+
+---
+
+## Memory Management
+
+Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine.
+
+### Key Behaviors
+
+1. **Models are stored on disk** — you can download as many as disk allows
+2. **Only the active model loads into RAM** — previous model is evicted when switching
+3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable)
+4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately
+
+### Controlling Idle Timeout
+
+```bash
+# Default: 5 minutes
+ollama serve
+
+# Unload immediately after each request (saves RAM)
+OLLAMA_KEEP_ALIVE=0 ollama serve
+
+# Keep loaded for 30 minutes
+OLLAMA_KEEP_ALIVE=30m ollama serve
+
+# Keep loaded forever (until manual unload or server restart)
+OLLAMA_KEEP_ALIVE=-1 ollama serve
+```
+
+### Manual Load/Unload
+
+```bash
+# Load a model into RAM (empty prompt trick)
+curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'
+
+# Unload a model from RAM immediately
+curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
+```
+
+### How Many Models Can You Have Downloaded?
+
+As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:
+
+| Count           | Approx Disk |
+| --------------- | ----------- |
+| 2 (current)     | ~24 GB      |
+| 5 (moderate)    | ~55 GB      |
+| 10 (full stack) | ~115 GB     |
+
+---
+
+## OpenAI-Compatible API
+
+Ollama exposes a drop-in OpenAI API at:
+
+```
+Base URL:  http://localhost:11434/v1
+API Key:   ollama  (any non-empty string)
+```
+
+### Example: curl
+
+```bash
+curl http://localhost:11434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1:8b",
+    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
+    "response_format": {"type": "json_object"}
+  }'
+```
+
+### Example: Node.js (OpenAI SDK)
+
+```typescript
+import OpenAI from 'openai';
+
+const client = new OpenAI({
+  baseURL: 'http://localhost:11434/v1',
+  apiKey: 'ollama',
+});
+
+const res = await client.chat.completions.create({
+  model: 'llama3.1:8b',
+  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
+  response_format: { type: 'json_object' },
+});
+```
+
+### Example: Python
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
+
+response = client.chat.completions.create(
+    model="llama3.1:8b",
+    messages=[{"role": "user", "content": "Extract action items from: ..."}],
+    response_format={"type": "json_object"},
+)
+```
+
+---
+
+## Native Ollama API
+
+Beyond the OpenAI-compatible endpoint, Ollama has its own API:
+
+| Endpoint          | Method | Purpose                             |
+| ----------------- | ------ | ----------------------------------- |
+| `/api/tags`       | GET    | List all downloaded models          |
+| `/api/ps`         | GET    | List models currently loaded in RAM |
+| `/api/generate`   | POST   | Generate text (single-turn)         |
+| `/api/chat`       | POST   | Chat completion (multi-turn)        |
+| `/api/pull`       | POST   | Download a model                    |
+| `/api/delete`     | DELETE | Remove a model from disk            |
+| `/api/show`       | POST   | Show model metadata                 |
+| `/api/embeddings` | POST   | Generate embeddings                 |
+
+Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md
+
+---
+
+## Performance on M4 Pro 48 GB
+
+- **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically
+- **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory
+- **Inference speed estimates:**
+  - 7B models: ~40-60 tok/s
+  - 32B models: ~15-25 tok/s
+  - 70B (Q4): ~5-10 tok/s
+- **RAM usage (model loaded):**
+  - 7B: ~5-6 GB
+  - 32B: ~20-22 GB
+  - 70B (Q4): ~40-42 GB
+
+### Performance Tuning
+
+```bash
+# Enable flash attention (faster, less RAM)
+OLLAMA_FLASH_ATTENTION=1 ollama serve
+
+# KV cache quantization (smaller RAM footprint)
+OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
+
+# Both together
+OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
+
+# Allow concurrent requests (default: 1)
+OLLAMA_NUM_PARALLEL=2 ollama serve
+```
--- a/__LOCAL_LLMs/docs/06-extraction-service-evals.md
+++ b/__LOCAL_LLMs/docs/06-extraction-service-evals.md
@ -0,0 +1,113 @@
+# 06 — Extraction Service Evals
+
+> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
+
+---
+
+## Overview
+
+The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
+
+- **Zero-cost iteration** on extraction prompts
+- **Side-by-side comparison** of local vs cloud model quality
+- **Offline development** when cloud APIs are unavailable
+
+---
+
+## Files
+
+| File                                                      | Purpose                                            |
+| --------------------------------------------------------- | -------------------------------------------------- |
+| `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
+| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
+| `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
+| `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
+| `services/extraction-service/evals/README.md`             | Full usage docs                                    |
+
+---
+
+## Running Evals
+
+```bash
+cd services/extraction-service
+
+# Ollama only (no extraction-service needed)
+pnpm eval:ollama
+
+# Different model
+OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
+
+# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
+pnpm eval:compare
+```
+
+### Prerequisites
+
+- Ollama must be running (`ollama serve`)
+- A model must be available (`ollama pull llama3.1:8b`)
+- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set
+
+---
+
+## Eval Coverage
+
+| Task                    | Cases | Key Assertions                                                  |
+| ----------------------- | ----- | --------------------------------------------------------------- |
+| `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
+| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
+| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
+| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
+| `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |
+
+**Total: 19 cases, 50+ assertions**
+
+---
+
+## Important: Assertion Pattern
+
+Ollama returns a raw JSON **string** — assertions must parse it inline:
+
+```yaml
+# ✅ Correct — parse the string first
+- type: javascript
+  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
+
+# ❌ Wrong — output is a string, not an object
+- type: javascript
+  value: output.classes.includes('action')
+```
+
+---
+
+## Pointing the Python Sidecar at Ollama
+
+The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
+
+```bash
+export LANGEXTRACT_PROVIDER=openai_compat
+export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
+export LANGEXTRACT_API_KEY=ollama
+export LANGEXTRACT_MODEL=llama3.1:8b
+```
+
+> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.
+
+---
+
+## Cost Comparison
+
+| Provider                            | Cost per full run | Notes                              |
+| ----------------------------------- | ----------------- | ---------------------------------- |
+| **Gemini** (via extraction-service) | ~$0.003–0.005     | gemini-2.5-flash                   |
+| **Ollama** (local)                  | $0.00             | Fully offline after model download |
+
+---
+
+## Recommended Models for Evals
+
+| Model               | JSON Quality | Speed    | Notes                           |
+| ------------------- | ------------ | -------- | ------------------------------- |
+| `llama3.1:8b`       | Good         | Fast     | Default, reliable JSON output   |
+| `qwen2.5:7b`        | Excellent    | Fast     | Best JSON structure compliance  |
+| `qwen2.5-coder:32b` | Excellent    | Moderate | Best quality, slower            |
+| `phi4`              | Good         | Fast     | Good reasoning for triage tasks |
--- a/__LOCAL_LLMs/docs/09-environment-variables.md
+++ b/__LOCAL_LLMs/docs/09-environment-variables.md
@ -0,0 +1,135 @@
+# 09 — Environment Variables Reference
+
+> All configuration variables for Ollama, Whisper, dashboard, and evals.
+
+---
+
+## Ollama Server
+
+| Variable                   | Default                  | Purpose                                                |
+| -------------------------- | ------------------------ | ------------------------------------------------------ |
+| `OLLAMA_HOST`              | `http://127.0.0.1:11434` | Server bind address                                    |
+| `OLLAMA_MODELS`            | `~/.ollama/models`       | Model storage path                                     |
+| `OLLAMA_KEEP_ALIVE`        | `5m`                     | How long to keep model loaded after last request       |
+| `OLLAMA_FLASH_ATTENTION`   | `false`                  | Enable flash attention (faster, less RAM)              |
+| `OLLAMA_KV_CACHE_TYPE`     | _(none)_                 | KV cache quantization (`q8_0` = smaller RAM footprint) |
+| `OLLAMA_NUM_PARALLEL`      | `1`                      | Number of concurrent requests                          |
+| `OLLAMA_MAX_LOADED_MODELS` | `1`                      | Max models loaded in RAM simultaneously                |
+| `OLLAMA_GPU_OVERHEAD`      | _(auto)_                 | Reserved GPU memory (bytes)                            |
+| `OLLAMA_ORIGINS`           | `*`                      | Allowed CORS origins                                   |
+| `OLLAMA_DEBUG`             | `false`                  | Enable debug logging                                   |
+| `HTTP_PROXY`               | _(system)_               | HTTP proxy for model downloads                         |
+| `HTTPS_PROXY`              | _(system)_               | HTTPS proxy for model downloads                        |
+| `NO_PROXY`                 | _(none)_                 | Hosts to bypass proxy                                  |
+
+### Performance Tuning Combo
+
+```bash
+OLLAMA_FLASH_ATTENTION=1 \
+OLLAMA_KV_CACHE_TYPE=q8_0 \
+OLLAMA_NUM_PARALLEL=2 \
+OLLAMA_KEEP_ALIVE=10m \
+ollama serve
+```
+
+---
+
+## Extraction Service Evals (promptfoo)
+
+| Variable                | Default                     | Purpose                                 |
+| ----------------------- | --------------------------- | --------------------------------------- |
+| `OLLAMA_MODEL`          | `llama3.1:8b`               | Model used by `pnpm eval:ollama`        |
+| `OLLAMA_BASE_URL`       | `http://localhost:11434/v1` | OpenAI-compat endpoint for promptfoo    |
+| `EXTRACTION_EVAL_TOKEN` | _(none)_                    | Auth token for extraction-service evals |
+
+### Usage
+
+```bash
+# Run evals with a different model
+OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
+
+# Compare Gemini vs Ollama
+EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare
+```
+
+---
+
+## Python Sidecar (LangExtract)
+
+| Variable               | Default          | Purpose                                       |
+| ---------------------- | ---------------- | --------------------------------------------- |
+| `LANGEXTRACT_PROVIDER` | `gemini`         | Switch to `openai_compat` for Ollama          |
+| `LANGEXTRACT_BASE_URL` | _(Gemini)_       | Set to `http://localhost:11434/v1` for Ollama |
+| `LANGEXTRACT_API_KEY`  | _(Gemini key)_   | Set to `ollama` for local                     |
+| `LANGEXTRACT_MODEL`    | _(Gemini model)_ | Set to `llama3.1:8b` or preferred model       |
+
+### Switch to Ollama
+
+```bash
+export LANGEXTRACT_PROVIDER=openai_compat
+export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
+export LANGEXTRACT_API_KEY=ollama
+export LANGEXTRACT_MODEL=llama3.1:8b
+```
+
+---
+
+## Mission Control Dashboard
+
+| Variable     | Default                  | Purpose                                |
+| ------------ | ------------------------ | -------------------------------------- |
+| `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL (used by API routes) |
+| `PORT`       | `3100`                   | Dashboard dev server port              |
+
+### Start with Custom Ollama URL
+
+```bash
+OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100
+```
+
+---
+
+## Whisper.cpp
+
+Whisper.cpp uses CLI flags rather than environment variables:
+
+| Flag              | Purpose                       | Example                                            |
+| ----------------- | ----------------------------- | -------------------------------------------------- |
+| `--model`         | Path to GGML model file       | `--model ~/whisper-models/ggml-large-v3-turbo.bin` |
+| `--language`      | Input language                | `--language en`                                    |
+| `--file`          | Audio file path               | `--file recording.wav`                             |
+| `--output-json`   | Output in JSON format         | `--output-json`                                    |
+| `--output-srt`    | Output as SRT subtitles       | `--output-srt`                                     |
+| `--output-vtt`    | Output as VTT subtitles       | `--output-vtt`                                     |
+| `--translate`     | Translate to English          | `--translate`                                      |
+| `--threads`       | Number of CPU threads         | `--threads 8`                                      |
+| `--processors`    | Number of processors          | `--processors 1`                                   |
+| `--print-colors`  | Colorize output by confidence | `--print-colors`                                   |
+| `--no-timestamps` | Omit timestamps               | `--no-timestamps`                                  |
+| `--port`          | Server port (whisper-server)  | `--port 8080`                                      |
+
+---
+
+## Proxy / Network (Corporate)
+
+| Variable                       | Value on This Machine            | Purpose                                           |
+| ------------------------------ | -------------------------------- | ------------------------------------------------- |
+| `HTTP_PROXY`                   | `http://cso.proxy.att.com:8080/` | Corporate HTTP proxy                              |
+| `HTTPS_PROXY`                  | `http://cso.proxy.att.com:8080/` | Corporate HTTPS proxy                             |
+| `NODE_TLS_REJECT_UNAUTHORIZED` | `0`                              | Bypass Forcepoint SSL interception for Node.js    |
+| `NO_PROXY`                     | _(not set by default)_           | Add `ollama.com,registry.ollama.ai` if pulls fail |
+
+---
+
+## All Paths
+
+| Path                                 | Content                     |
+| ------------------------------------ | --------------------------- |
+| `~/.ollama/models/`                  | Downloaded Ollama models    |
+| `~/whisper-models/`                  | Whisper GGML model files    |
+| `/opt/homebrew/bin/ollama`           | Ollama binary               |
+| `/opt/homebrew/bin/whisper-cli`      | Whisper CLI binary          |
+| `/opt/homebrew/bin/ffmpeg`           | FFmpeg binary               |
+| `__LOCAL_LLMs/dashboard/`            | Mission Control Next.js app |
+| `__LOCAL_LLMs/docs/`                 | This documentation          |
+| `services/extraction-service/evals/` | Promptfoo eval configs      |