docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management, idle timeout, manual load/unload, OpenAI-compatible API, native API reference, performance tuning flags (flash attention, KV cache) - docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases across 5 tasks, assertion patterns for JSON string output, Python sidecar config - docs/09-environment-variables.md: comprehensive var reference for Ollama server, evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
2026-02-19 13:01:05 -08:00 · 2026-02-19 13:01:05 -08:00 · 80f794dee7
commit 80f794dee7
parent 464ffb92ec
3 changed files with 478 additions and 0 deletions
--- a/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
+++ b/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
@ -0,0 +1,230 @@
 # 02 — Ollama Setup & Models
 > Installation, server configuration, model management, and memory behavior.
 ---
 ## Installation
 ```bash
 brew install ollama
 ```
 - **Version installed:** 0.16.2
 - **Binary:** `/opt/homebrew/opt/ollama/bin/ollama`
 - **Models stored at:** `~/.ollama/models/`
 - **Config:** No config file — uses environment variables
 ---
 ## Starting the Server
 ```bash
 # Option A: foreground (dev, see logs)
 ollama serve
 # Option B: background service (auto-start at login)
 brew services start ollama
 # Check if running
 curl http://localhost:11434/api/tags
 ```
 **Server listens on:** `http://127.0.0.1:11434`
 ### Corporate Proxy Note
 Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails:
 ```bash
 NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
 ```
 ---
 ## Models Currently Installed
 Verified 2026-02-19:
 | Model               | Size   | Pull Command                    | Use Case                                      |
 | ------------------- | ------ | ------------------------------- | --------------------------------------------- |
 | `qwen2.5-coder:32b` | 19 GB  | `ollama pull qwen2.5-coder:32b` | Best coding model — Swift, TypeScript, Python |
 | `llama3.1:8b`       | 4.9 GB | `ollama pull llama3.1:8b`       | Default for evals, fast inference             |
 ### Useful Commands
 ```bash
 # List all downloaded models (disk)
 ollama list
 # Show what's currently loaded in RAM
 ollama ps
 # Pull a new model (downloads to ~/.ollama/models/)
 ollama pull <model>
 # Run interactively
 ollama run <model>
 # Run with a one-shot prompt
 ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"
 # Remove a model from disk
 ollama rm <model>
 # Show model details (size, parameters, template)
 ollama show <model>
 ```
 ---
 ## Memory Management
 Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine.
 ### Key Behaviors
 1. **Models are stored on disk** — you can download as many as disk allows
 2. **Only the active model loads into RAM** — previous model is evicted when switching
 3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable)
 4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately
 ### Controlling Idle Timeout
 ```bash
 # Default: 5 minutes
 ollama serve
 # Unload immediately after each request (saves RAM)
 OLLAMA_KEEP_ALIVE=0 ollama serve
 # Keep loaded for 30 minutes
 OLLAMA_KEEP_ALIVE=30m ollama serve
 # Keep loaded forever (until manual unload or server restart)
 OLLAMA_KEEP_ALIVE=-1 ollama serve
 ```
 ### Manual Load/Unload
 ```bash
 # Load a model into RAM (empty prompt trick)
 curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'
 # Unload a model from RAM immediately
 curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
 ```
 ### How Many Models Can You Have Downloaded?
 As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:
 | Count           | Approx Disk |
 | --------------- | ----------- |
 | 2 (current)     | ~24 GB      |
 | 5 (moderate)    | ~55 GB      |
 | 10 (full stack) | ~115 GB     |
 ---
 ## OpenAI-Compatible API
 Ollama exposes a drop-in OpenAI API at:
 ```
 Base URL:  http://localhost:11434/v1
 API Key:   ollama  (any non-empty string)
 ```
 ### Example: curl
 ```bash
 curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
    "response_format": {"type": "json_object"}
  }'
 ```
 ### Example: Node.js (OpenAI SDK)
 ```typescript
 import OpenAI from 'openai';
 const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
 });
 const res = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
 });
 ```
 ### Example: Python
 ```python
 from openai import OpenAI
 client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
 response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
 )
 ```
 ---
 ## Native Ollama API
 Beyond the OpenAI-compatible endpoint, Ollama has its own API:
 | Endpoint          | Method | Purpose                             |
 | ----------------- | ------ | ----------------------------------- |
 | `/api/tags`       | GET    | List all downloaded models          |
 | `/api/ps`         | GET    | List models currently loaded in RAM |
 | `/api/generate`   | POST   | Generate text (single-turn)         |
 | `/api/chat`       | POST   | Chat completion (multi-turn)        |
 | `/api/pull`       | POST   | Download a model                    |
 | `/api/delete`     | DELETE | Remove a model from disk            |
 | `/api/show`       | POST   | Show model metadata                 |
 | `/api/embeddings` | POST   | Generate embeddings                 |
 Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md
 ---
 ## Performance on M4 Pro 48 GB
 - **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically
 - **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory
 - **Inference speed estimates:**
  - 7B models: ~40-60 tok/s
  - 32B models: ~15-25 tok/s
  - 70B (Q4): ~5-10 tok/s
 - **RAM usage (model loaded):**
  - 7B: ~5-6 GB
  - 32B: ~20-22 GB
  - 70B (Q4): ~40-42 GB
 ### Performance Tuning
 ```bash
 # Enable flash attention (faster, less RAM)
 OLLAMA_FLASH_ATTENTION=1 ollama serve
 # KV cache quantization (smaller RAM footprint)
 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
 # Both together
 OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
 # Allow concurrent requests (default: 1)
 OLLAMA_NUM_PARALLEL=2 ollama serve
 ```
--- a/__LOCAL_LLMs/docs/06-extraction-service-evals.md
+++ b/__LOCAL_LLMs/docs/06-extraction-service-evals.md
@ -0,0 +1,113 @@
 # 06 — Extraction Service Evals
 > Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
 ---
 ## Overview
 The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
 - **Zero-cost iteration** on extraction prompts
 - **Side-by-side comparison** of local vs cloud model quality
 - **Offline development** when cloud APIs are unavailable
 ---
 ## Files
 | File                                                      | Purpose                                            |
 | --------------------------------------------------------- | -------------------------------------------------- |
 | `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
 | `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
 | `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
 | `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
 | `services/extraction-service/evals/README.md`             | Full usage docs                                    |
 ---
 ## Running Evals
 ```bash
 cd services/extraction-service
 # Ollama only (no extraction-service needed)
 pnpm eval:ollama
 # Different model
 OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
 # Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
 pnpm eval:compare
 ```
 ### Prerequisites
 - Ollama must be running (`ollama serve`)
 - A model must be available (`ollama pull llama3.1:8b`)
 - For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set
 ---
 ## Eval Coverage
 | Task                    | Cases | Key Assertions                                                  |
 | ----------------------- | ----- | --------------------------------------------------------------- |
 | `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
 | `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
 | `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
 | `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
 | `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |
 **Total: 19 cases, 50+ assertions**
 ---
 ## Important: Assertion Pattern
 Ollama returns a raw JSON **string** — assertions must parse it inline:
 ```yaml
 # ✅ Correct — parse the string first
 - type: javascript
  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
 # ❌ Wrong — output is a string, not an object
 - type: javascript
  value: output.classes.includes('action')
 ```
 ---
 ## Pointing the Python Sidecar at Ollama
 The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
 ```bash
 export LANGEXTRACT_PROVIDER=openai_compat
 export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
 export LANGEXTRACT_API_KEY=ollama
 export LANGEXTRACT_MODEL=llama3.1:8b
 ```
 > Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.
 ---
 ## Cost Comparison
 | Provider                            | Cost per full run | Notes                              |
 | ----------------------------------- | ----------------- | ---------------------------------- |
 | **Gemini** (via extraction-service) | ~$0.003–0.005     | gemini-2.5-flash                   |
 | **Ollama** (local)                  | $0.00             | Fully offline after model download |
 ---
 ## Recommended Models for Evals
 | Model               | JSON Quality | Speed    | Notes                           |
 | ------------------- | ------------ | -------- | ------------------------------- |
 | `llama3.1:8b`       | Good         | Fast     | Default, reliable JSON output   |
 | `qwen2.5:7b`        | Excellent    | Fast     | Best JSON structure compliance  |
 | `qwen2.5-coder:32b` | Excellent    | Moderate | Best quality, slower            |
 | `phi4`              | Good         | Fast     | Good reasoning for triage tasks |
--- a/__LOCAL_LLMs/docs/09-environment-variables.md
+++ b/__LOCAL_LLMs/docs/09-environment-variables.md
@ -0,0 +1,135 @@
 # 09 — Environment Variables Reference
 > All configuration variables for Ollama, Whisper, dashboard, and evals.
 ---
 ## Ollama Server
 | Variable                   | Default                  | Purpose                                                |
 | -------------------------- | ------------------------ | ------------------------------------------------------ |
 | `OLLAMA_HOST`              | `http://127.0.0.1:11434` | Server bind address                                    |
 | `OLLAMA_MODELS`            | `~/.ollama/models`       | Model storage path                                     |
 | `OLLAMA_KEEP_ALIVE`        | `5m`                     | How long to keep model loaded after last request       |
 | `OLLAMA_FLASH_ATTENTION`   | `false`                  | Enable flash attention (faster, less RAM)              |
 | `OLLAMA_KV_CACHE_TYPE`     | _(none)_                 | KV cache quantization (`q8_0` = smaller RAM footprint) |
 | `OLLAMA_NUM_PARALLEL`      | `1`                      | Number of concurrent requests                          |
 | `OLLAMA_MAX_LOADED_MODELS` | `1`                      | Max models loaded in RAM simultaneously                |
 | `OLLAMA_GPU_OVERHEAD`      | _(auto)_                 | Reserved GPU memory (bytes)                            |
 | `OLLAMA_ORIGINS`           | `*`                      | Allowed CORS origins                                   |
 | `OLLAMA_DEBUG`             | `false`                  | Enable debug logging                                   |
 | `HTTP_PROXY`               | _(system)_               | HTTP proxy for model downloads                         |
 | `HTTPS_PROXY`              | _(system)_               | HTTPS proxy for model downloads                        |
 | `NO_PROXY`                 | _(none)_                 | Hosts to bypass proxy                                  |
 ### Performance Tuning Combo
 ```bash
 OLLAMA_FLASH_ATTENTION=1 \
 OLLAMA_KV_CACHE_TYPE=q8_0 \
 OLLAMA_NUM_PARALLEL=2 \
 OLLAMA_KEEP_ALIVE=10m \
 ollama serve
 ```
 ---
 ## Extraction Service Evals (promptfoo)
 | Variable                | Default                     | Purpose                                 |
 | ----------------------- | --------------------------- | --------------------------------------- |
 | `OLLAMA_MODEL`          | `llama3.1:8b`               | Model used by `pnpm eval:ollama`        |
 | `OLLAMA_BASE_URL`       | `http://localhost:11434/v1` | OpenAI-compat endpoint for promptfoo    |
 | `EXTRACTION_EVAL_TOKEN` | _(none)_                    | Auth token for extraction-service evals |
 ### Usage
 ```bash
 # Run evals with a different model
 OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
 # Compare Gemini vs Ollama
 EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare
 ```
 ---
 ## Python Sidecar (LangExtract)
 | Variable               | Default          | Purpose                                       |
 | ---------------------- | ---------------- | --------------------------------------------- |
 | `LANGEXTRACT_PROVIDER` | `gemini`         | Switch to `openai_compat` for Ollama          |
 | `LANGEXTRACT_BASE_URL` | _(Gemini)_       | Set to `http://localhost:11434/v1` for Ollama |
 | `LANGEXTRACT_API_KEY`  | _(Gemini key)_   | Set to `ollama` for local                     |
 | `LANGEXTRACT_MODEL`    | _(Gemini model)_ | Set to `llama3.1:8b` or preferred model       |
 ### Switch to Ollama
 ```bash
 export LANGEXTRACT_PROVIDER=openai_compat
 export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
 export LANGEXTRACT_API_KEY=ollama
 export LANGEXTRACT_MODEL=llama3.1:8b
 ```
 ---
 ## Mission Control Dashboard
 | Variable     | Default                  | Purpose                                |
 | ------------ | ------------------------ | -------------------------------------- |
 | `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL (used by API routes) |
 | `PORT`       | `3100`                   | Dashboard dev server port              |
 ### Start with Custom Ollama URL
 ```bash
 OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100
 ```
 ---
 ## Whisper.cpp
 Whisper.cpp uses CLI flags rather than environment variables:
 | Flag              | Purpose                       | Example                                            |
 | ----------------- | ----------------------------- | -------------------------------------------------- |
 | `--model`         | Path to GGML model file       | `--model ~/whisper-models/ggml-large-v3-turbo.bin` |
 | `--language`      | Input language                | `--language en`                                    |
 | `--file`          | Audio file path               | `--file recording.wav`                             |
 | `--output-json`   | Output in JSON format         | `--output-json`                                    |
 | `--output-srt`    | Output as SRT subtitles       | `--output-srt`                                     |
 | `--output-vtt`    | Output as VTT subtitles       | `--output-vtt`                                     |
 | `--translate`     | Translate to English          | `--translate`                                      |
 | `--threads`       | Number of CPU threads         | `--threads 8`                                      |
 | `--processors`    | Number of processors          | `--processors 1`                                   |
 | `--print-colors`  | Colorize output by confidence | `--print-colors`                                   |
 | `--no-timestamps` | Omit timestamps               | `--no-timestamps`                                  |
 | `--port`          | Server port (whisper-server)  | `--port 8080`                                      |
 ---
 ## Proxy / Network (Corporate)
 | Variable                       | Value on This Machine            | Purpose                                           |
 | ------------------------------ | -------------------------------- | ------------------------------------------------- |
 | `HTTP_PROXY`                   | `http://cso.proxy.att.com:8080/` | Corporate HTTP proxy                              |
 | `HTTPS_PROXY`                  | `http://cso.proxy.att.com:8080/` | Corporate HTTPS proxy                             |
 | `NODE_TLS_REJECT_UNAUTHORIZED` | `0`                              | Bypass Forcepoint SSL interception for Node.js    |
 | `NO_PROXY`                     | _(not set by default)_           | Add `ollama.com,registry.ollama.ai` if pulls fail |
 ---
 ## All Paths
 | Path                                 | Content                     |
 | ------------------------------------ | --------------------------- |
 | `~/.ollama/models/`                  | Downloaded Ollama models    |
 | `~/whisper-models/`                  | Whisper GGML model files    |
 | `/opt/homebrew/bin/ollama`           | Ollama binary               |
 | `/opt/homebrew/bin/whisper-cli`      | Whisper CLI binary          |
 | `/opt/homebrew/bin/ffmpeg`           | FFmpeg binary               |
 | `__LOCAL_LLMs/dashboard/`            | Mission Control Next.js app |
 | `__LOCAL_LLMs/docs/`                 | This documentation          |
 | `services/extraction-service/evals/` | Promptfoo eval configs      |