docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management,
  idle timeout, manual load/unload, OpenAI-compatible API, native API reference,
  performance tuning flags (flash attention, KV cache)
- docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases
  across 5 tasks, assertion patterns for JSON string output, Python sidecar config
- docs/09-environment-variables.md: comprehensive var reference for Ollama server,
  evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
This commit is contained in:
saravanakumardb1 2026-02-19 13:01:05 -08:00
parent 464ffb92ec
commit 80f794dee7
3 changed files with 478 additions and 0 deletions

View File

@ -0,0 +1,230 @@
# 02 — Ollama Setup & Models
> Installation, server configuration, model management, and memory behavior.
---
## Installation
```bash
brew install ollama
```
- **Version installed:** 0.16.2
- **Binary:** `/opt/homebrew/opt/ollama/bin/ollama`
- **Models stored at:** `~/.ollama/models/`
- **Config:** No config file — uses environment variables
---
## Starting the Server
```bash
# Option A: foreground (dev, see logs)
ollama serve
# Option B: background service (auto-start at login)
brew services start ollama
# Check if running
curl http://localhost:11434/api/tags
```
**Server listens on:** `http://127.0.0.1:11434`
### Corporate Proxy Note
Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails:
```bash
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
```
---
## Models Currently Installed
Verified 2026-02-19:
| Model | Size | Pull Command | Use Case |
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | Best coding model — Swift, TypeScript, Python |
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | Default for evals, fast inference |
### Useful Commands
```bash
# List all downloaded models (disk)
ollama list
# Show what's currently loaded in RAM
ollama ps
# Pull a new model (downloads to ~/.ollama/models/)
ollama pull <model>
# Run interactively
ollama run <model>
# Run with a one-shot prompt
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"
# Remove a model from disk
ollama rm <model>
# Show model details (size, parameters, template)
ollama show <model>
```
---
## Memory Management
Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine.
### Key Behaviors
1. **Models are stored on disk** — you can download as many as disk allows
2. **Only the active model loads into RAM** — previous model is evicted when switching
3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable)
4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately
### Controlling Idle Timeout
```bash
# Default: 5 minutes
ollama serve
# Unload immediately after each request (saves RAM)
OLLAMA_KEEP_ALIVE=0 ollama serve
# Keep loaded for 30 minutes
OLLAMA_KEEP_ALIVE=30m ollama serve
# Keep loaded forever (until manual unload or server restart)
OLLAMA_KEEP_ALIVE=-1 ollama serve
```
### Manual Load/Unload
```bash
# Load a model into RAM (empty prompt trick)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'
# Unload a model from RAM immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
```
### How Many Models Can You Have Downloaded?
As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:
| Count | Approx Disk |
| --------------- | ----------- |
| 2 (current) | ~24 GB |
| 5 (moderate) | ~55 GB |
| 10 (full stack) | ~115 GB |
---
## OpenAI-Compatible API
Ollama exposes a drop-in OpenAI API at:
```
Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string)
```
### Example: curl
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
"response_format": {"type": "json_object"}
}'
```
### Example: Node.js (OpenAI SDK)
```typescript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await client.chat.completions.create({
model: 'llama3.1:8b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
```
### Example: Python
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
```
---
## Native Ollama API
Beyond the OpenAI-compatible endpoint, Ollama has its own API:
| Endpoint | Method | Purpose |
| ----------------- | ------ | ----------------------------------- |
| `/api/tags` | GET | List all downloaded models |
| `/api/ps` | GET | List models currently loaded in RAM |
| `/api/generate` | POST | Generate text (single-turn) |
| `/api/chat` | POST | Chat completion (multi-turn) |
| `/api/pull` | POST | Download a model |
| `/api/delete` | DELETE | Remove a model from disk |
| `/api/show` | POST | Show model metadata |
| `/api/embeddings` | POST | Generate embeddings |
Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md
---
## Performance on M4 Pro 48 GB
- **MLX warning:** `MLX dynamic library not available`**harmless**, falls back to Metal/CPU automatically
- **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory
- **Inference speed estimates:**
- 7B models: ~40-60 tok/s
- 32B models: ~15-25 tok/s
- 70B (Q4): ~5-10 tok/s
- **RAM usage (model loaded):**
- 7B: ~5-6 GB
- 32B: ~20-22 GB
- 70B (Q4): ~40-42 GB
### Performance Tuning
```bash
# Enable flash attention (faster, less RAM)
OLLAMA_FLASH_ATTENTION=1 ollama serve
# KV cache quantization (smaller RAM footprint)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# Both together
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# Allow concurrent requests (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve
```

View File

@ -0,0 +1,113 @@
# 06 — Extraction Service Evals
> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
---
## Overview
The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
- **Zero-cost iteration** on extraction prompts
- **Side-by-side comparison** of local vs cloud model quality
- **Offline development** when cloud APIs are unavailable
---
## Files
| File | Purpose |
| --------------------------------------------------------- | -------------------------------------------------- |
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
| `services/extraction-service/evals/README.md` | Full usage docs |
---
## Running Evals
```bash
cd services/extraction-service
# Ollama only (no extraction-service needed)
pnpm eval:ollama
# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
```
### Prerequisites
- Ollama must be running (`ollama serve`)
- A model must be available (`ollama pull llama3.1:8b`)
- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set
---
## Eval Coverage
| Task | Cases | Key Assertions |
| ----------------------- | ----- | --------------------------------------------------------------- |
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
**Total: 19 cases, 50+ assertions**
---
## Important: Assertion Pattern
Ollama returns a raw JSON **string** — assertions must parse it inline:
```yaml
# ✅ Correct — parse the string first
- type: javascript
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
# ❌ Wrong — output is a string, not an object
- type: javascript
value: output.classes.includes('action')
```
---
## Pointing the Python Sidecar at Ollama
The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```
> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.
---
## Cost Comparison
| Provider | Cost per full run | Notes |
| ----------------------------------- | ----------------- | ---------------------------------- |
| **Gemini** (via extraction-service) | ~$0.0030.005 | gemini-2.5-flash |
| **Ollama** (local) | $0.00 | Fully offline after model download |
---
## Recommended Models for Evals
| Model | JSON Quality | Speed | Notes |
| ------------------- | ------------ | -------- | ------------------------------- |
| `llama3.1:8b` | Good | Fast | Default, reliable JSON output |
| `qwen2.5:7b` | Excellent | Fast | Best JSON structure compliance |
| `qwen2.5-coder:32b` | Excellent | Moderate | Best quality, slower |
| `phi4` | Good | Fast | Good reasoning for triage tasks |

View File

@ -0,0 +1,135 @@
# 09 — Environment Variables Reference
> All configuration variables for Ollama, Whisper, dashboard, and evals.
---
## Ollama Server
| Variable | Default | Purpose |
| -------------------------- | ------------------------ | ------------------------------------------------------ |
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM footprint) |
| `OLLAMA_NUM_PARALLEL` | `1` | Number of concurrent requests |
| `OLLAMA_MAX_LOADED_MODELS` | `1` | Max models loaded in RAM simultaneously |
| `OLLAMA_GPU_OVERHEAD` | _(auto)_ | Reserved GPU memory (bytes) |
| `OLLAMA_ORIGINS` | `*` | Allowed CORS origins |
| `OLLAMA_DEBUG` | `false` | Enable debug logging |
| `HTTP_PROXY` | _(system)_ | HTTP proxy for model downloads |
| `HTTPS_PROXY` | _(system)_ | HTTPS proxy for model downloads |
| `NO_PROXY` | _(none)_ | Hosts to bypass proxy |
### Performance Tuning Combo
```bash
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_NUM_PARALLEL=2 \
OLLAMA_KEEP_ALIVE=10m \
ollama serve
```
---
## Extraction Service Evals (promptfoo)
| Variable | Default | Purpose |
| ----------------------- | --------------------------- | --------------------------------------- |
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | OpenAI-compat endpoint for promptfoo |
| `EXTRACTION_EVAL_TOKEN` | _(none)_ | Auth token for extraction-service evals |
### Usage
```bash
# Run evals with a different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama
EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare
```
---
## Python Sidecar (LangExtract)
| Variable | Default | Purpose |
| ---------------------- | ---------------- | --------------------------------------------- |
| `LANGEXTRACT_PROVIDER` | `gemini` | Switch to `openai_compat` for Ollama |
| `LANGEXTRACT_BASE_URL` | _(Gemini)_ | Set to `http://localhost:11434/v1` for Ollama |
| `LANGEXTRACT_API_KEY` | _(Gemini key)_ | Set to `ollama` for local |
| `LANGEXTRACT_MODEL` | _(Gemini model)_ | Set to `llama3.1:8b` or preferred model |
### Switch to Ollama
```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```
---
## Mission Control Dashboard
| Variable | Default | Purpose |
| ------------ | ------------------------ | -------------------------------------- |
| `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL (used by API routes) |
| `PORT` | `3100` | Dashboard dev server port |
### Start with Custom Ollama URL
```bash
OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100
```
---
## Whisper.cpp
Whisper.cpp uses CLI flags rather than environment variables:
| Flag | Purpose | Example |
| ----------------- | ----------------------------- | -------------------------------------------------- |
| `--model` | Path to GGML model file | `--model ~/whisper-models/ggml-large-v3-turbo.bin` |
| `--language` | Input language | `--language en` |
| `--file` | Audio file path | `--file recording.wav` |
| `--output-json` | Output in JSON format | `--output-json` |
| `--output-srt` | Output as SRT subtitles | `--output-srt` |
| `--output-vtt` | Output as VTT subtitles | `--output-vtt` |
| `--translate` | Translate to English | `--translate` |
| `--threads` | Number of CPU threads | `--threads 8` |
| `--processors` | Number of processors | `--processors 1` |
| `--print-colors` | Colorize output by confidence | `--print-colors` |
| `--no-timestamps` | Omit timestamps | `--no-timestamps` |
| `--port` | Server port (whisper-server) | `--port 8080` |
---
## Proxy / Network (Corporate)
| Variable | Value on This Machine | Purpose |
| ------------------------------ | -------------------------------- | ------------------------------------------------- |
| `HTTP_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTP proxy |
| `HTTPS_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTPS proxy |
| `NODE_TLS_REJECT_UNAUTHORIZED` | `0` | Bypass Forcepoint SSL interception for Node.js |
| `NO_PROXY` | _(not set by default)_ | Add `ollama.com,registry.ollama.ai` if pulls fail |
---
## All Paths
| Path | Content |
| ------------------------------------ | --------------------------- |
| `~/.ollama/models/` | Downloaded Ollama models |
| `~/whisper-models/` | Whisper GGML model files |
| `/opt/homebrew/bin/ollama` | Ollama binary |
| `/opt/homebrew/bin/whisper-cli` | Whisper CLI binary |
| `/opt/homebrew/bin/ffmpeg` | FFmpeg binary |
| `__LOCAL_LLMs/dashboard/` | Mission Control Next.js app |
| `__LOCAL_LLMs/docs/` | This documentation |
| `services/extraction-service/evals/` | Promptfoo eval configs |