diff --git a/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md b/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md new file mode 100644 index 00000000..6db7ce54 --- /dev/null +++ b/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md @@ -0,0 +1,230 @@ +# 02 — Ollama Setup & Models + +> Installation, server configuration, model management, and memory behavior. + +--- + +## Installation + +```bash +brew install ollama +``` + +- **Version installed:** 0.16.2 +- **Binary:** `/opt/homebrew/opt/ollama/bin/ollama` +- **Models stored at:** `~/.ollama/models/` +- **Config:** No config file — uses environment variables + +--- + +## Starting the Server + +```bash +# Option A: foreground (dev, see logs) +ollama serve + +# Option B: background service (auto-start at login) +brew services start ollama + +# Check if running +curl http://localhost:11434/api/tags +``` + +**Server listens on:** `http://127.0.0.1:11434` + +### Corporate Proxy Note + +Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails: + +```bash +NO_PROXY="ollama.com,registry.ollama.ai" ollama pull +``` + +--- + +## Models Currently Installed + +Verified 2026-02-19: + +| Model | Size | Pull Command | Use Case | +| ------------------- | ------ | ------------------------------- | --------------------------------------------- | +| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | Best coding model — Swift, TypeScript, Python | +| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | Default for evals, fast inference | + +### Useful Commands + +```bash +# List all downloaded models (disk) +ollama list + +# Show what's currently loaded in RAM +ollama ps + +# Pull a new model (downloads to ~/.ollama/models/) +ollama pull + +# Run interactively +ollama run + +# Run with a one-shot prompt +ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion" + +# Remove a model from disk +ollama rm + +# Show model details (size, parameters, template) +ollama show +``` + +--- + +## Memory Management + +Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine. + +### Key Behaviors + +1. **Models are stored on disk** — you can download as many as disk allows +2. **Only the active model loads into RAM** — previous model is evicted when switching +3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable) +4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately + +### Controlling Idle Timeout + +```bash +# Default: 5 minutes +ollama serve + +# Unload immediately after each request (saves RAM) +OLLAMA_KEEP_ALIVE=0 ollama serve + +# Keep loaded for 30 minutes +OLLAMA_KEEP_ALIVE=30m ollama serve + +# Keep loaded forever (until manual unload or server restart) +OLLAMA_KEEP_ALIVE=-1 ollama serve +``` + +### Manual Load/Unload + +```bash +# Load a model into RAM (empty prompt trick) +curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}' + +# Unload a model from RAM immediately +curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}' +``` + +### How Many Models Can You Have Downloaded? + +As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models: + +| Count | Approx Disk | +| --------------- | ----------- | +| 2 (current) | ~24 GB | +| 5 (moderate) | ~55 GB | +| 10 (full stack) | ~115 GB | + +--- + +## OpenAI-Compatible API + +Ollama exposes a drop-in OpenAI API at: + +``` +Base URL: http://localhost:11434/v1 +API Key: ollama (any non-empty string) +``` + +### Example: curl + +```bash +curl http://localhost:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama3.1:8b", + "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}], + "response_format": {"type": "json_object"} + }' +``` + +### Example: Node.js (OpenAI SDK) + +```typescript +import OpenAI from 'openai'; + +const client = new OpenAI({ + baseURL: 'http://localhost:11434/v1', + apiKey: 'ollama', +}); + +const res = await client.chat.completions.create({ + model: 'llama3.1:8b', + messages: [{ role: 'user', content: 'Extract action items from: ...' }], + response_format: { type: 'json_object' }, +}); +``` + +### Example: Python + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") + +response = client.chat.completions.create( + model="llama3.1:8b", + messages=[{"role": "user", "content": "Extract action items from: ..."}], + response_format={"type": "json_object"}, +) +``` + +--- + +## Native Ollama API + +Beyond the OpenAI-compatible endpoint, Ollama has its own API: + +| Endpoint | Method | Purpose | +| ----------------- | ------ | ----------------------------------- | +| `/api/tags` | GET | List all downloaded models | +| `/api/ps` | GET | List models currently loaded in RAM | +| `/api/generate` | POST | Generate text (single-turn) | +| `/api/chat` | POST | Chat completion (multi-turn) | +| `/api/pull` | POST | Download a model | +| `/api/delete` | DELETE | Remove a model from disk | +| `/api/show` | POST | Show model metadata | +| `/api/embeddings` | POST | Generate embeddings | + +Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md + +--- + +## Performance on M4 Pro 48 GB + +- **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically +- **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory +- **Inference speed estimates:** + - 7B models: ~40-60 tok/s + - 32B models: ~15-25 tok/s + - 70B (Q4): ~5-10 tok/s +- **RAM usage (model loaded):** + - 7B: ~5-6 GB + - 32B: ~20-22 GB + - 70B (Q4): ~40-42 GB + +### Performance Tuning + +```bash +# Enable flash attention (faster, less RAM) +OLLAMA_FLASH_ATTENTION=1 ollama serve + +# KV cache quantization (smaller RAM footprint) +OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve + +# Both together +OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve + +# Allow concurrent requests (default: 1) +OLLAMA_NUM_PARALLEL=2 ollama serve +``` diff --git a/__LOCAL_LLMs/docs/06-extraction-service-evals.md b/__LOCAL_LLMs/docs/06-extraction-service-evals.md new file mode 100644 index 00000000..f0221b41 --- /dev/null +++ b/__LOCAL_LLMs/docs/06-extraction-service-evals.md @@ -0,0 +1,113 @@ +# 06 — Extraction Service Evals + +> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation. + +--- + +## Overview + +The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables: + +- **Zero-cost iteration** on extraction prompts +- **Side-by-side comparison** of local vs cloud model quality +- **Offline development** when cloud APIs are unavailable + +--- + +## Files + +| File | Purpose | +| --------------------------------------------------------- | -------------------------------------------------- | +| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) | +| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly | +| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison | +| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures | +| `services/extraction-service/evals/README.md` | Full usage docs | + +--- + +## Running Evals + +```bash +cd services/extraction-service + +# Ollama only (no extraction-service needed) +pnpm eval:ollama + +# Different model +OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama + +# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN) +pnpm eval:compare +``` + +### Prerequisites + +- Ollama must be running (`ollama serve`) +- A model must be available (`ollama pull llama3.1:8b`) +- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set + +--- + +## Eval Coverage + +| Task | Cases | Key Assertions | +| ----------------------- | ----- | --------------------------------------------------------------- | +| `transcript-extraction` | 4 | action_item, deadline, person, decision, question | +| `triage` | 5 | brain_signal routing (health/work/money), emotion valence | +| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | +| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | +| `bug-report-extraction` | 2 | all 5 fields, severity level attribute | + +**Total: 19 cases, 50+ assertions** + +--- + +## Important: Assertion Pattern + +Ollama returns a raw JSON **string** — assertions must parse it inline: + +```yaml +# ✅ Correct — parse the string first +- type: javascript + value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');" + +# ❌ Wrong — output is a string, not an object +- type: javascript + value: output.classes.includes('action') +``` + +--- + +## Pointing the Python Sidecar at Ollama + +The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev: + +```bash +export LANGEXTRACT_PROVIDER=openai_compat +export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 +export LANGEXTRACT_API_KEY=ollama +export LANGEXTRACT_MODEL=llama3.1:8b +``` + +> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version. + +--- + +## Cost Comparison + +| Provider | Cost per full run | Notes | +| ----------------------------------- | ----------------- | ---------------------------------- | +| **Gemini** (via extraction-service) | ~$0.003–0.005 | gemini-2.5-flash | +| **Ollama** (local) | $0.00 | Fully offline after model download | + +--- + +## Recommended Models for Evals + +| Model | JSON Quality | Speed | Notes | +| ------------------- | ------------ | -------- | ------------------------------- | +| `llama3.1:8b` | Good | Fast | Default, reliable JSON output | +| `qwen2.5:7b` | Excellent | Fast | Best JSON structure compliance | +| `qwen2.5-coder:32b` | Excellent | Moderate | Best quality, slower | +| `phi4` | Good | Fast | Good reasoning for triage tasks | diff --git a/__LOCAL_LLMs/docs/09-environment-variables.md b/__LOCAL_LLMs/docs/09-environment-variables.md new file mode 100644 index 00000000..b2de3bbf --- /dev/null +++ b/__LOCAL_LLMs/docs/09-environment-variables.md @@ -0,0 +1,135 @@ +# 09 — Environment Variables Reference + +> All configuration variables for Ollama, Whisper, dashboard, and evals. + +--- + +## Ollama Server + +| Variable | Default | Purpose | +| -------------------------- | ------------------------ | ------------------------------------------------------ | +| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address | +| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path | +| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request | +| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) | +| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM footprint) | +| `OLLAMA_NUM_PARALLEL` | `1` | Number of concurrent requests | +| `OLLAMA_MAX_LOADED_MODELS` | `1` | Max models loaded in RAM simultaneously | +| `OLLAMA_GPU_OVERHEAD` | _(auto)_ | Reserved GPU memory (bytes) | +| `OLLAMA_ORIGINS` | `*` | Allowed CORS origins | +| `OLLAMA_DEBUG` | `false` | Enable debug logging | +| `HTTP_PROXY` | _(system)_ | HTTP proxy for model downloads | +| `HTTPS_PROXY` | _(system)_ | HTTPS proxy for model downloads | +| `NO_PROXY` | _(none)_ | Hosts to bypass proxy | + +### Performance Tuning Combo + +```bash +OLLAMA_FLASH_ATTENTION=1 \ +OLLAMA_KV_CACHE_TYPE=q8_0 \ +OLLAMA_NUM_PARALLEL=2 \ +OLLAMA_KEEP_ALIVE=10m \ +ollama serve +``` + +--- + +## Extraction Service Evals (promptfoo) + +| Variable | Default | Purpose | +| ----------------------- | --------------------------- | --------------------------------------- | +| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` | +| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | OpenAI-compat endpoint for promptfoo | +| `EXTRACTION_EVAL_TOKEN` | _(none)_ | Auth token for extraction-service evals | + +### Usage + +```bash +# Run evals with a different model +OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama + +# Compare Gemini vs Ollama +EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare +``` + +--- + +## Python Sidecar (LangExtract) + +| Variable | Default | Purpose | +| ---------------------- | ---------------- | --------------------------------------------- | +| `LANGEXTRACT_PROVIDER` | `gemini` | Switch to `openai_compat` for Ollama | +| `LANGEXTRACT_BASE_URL` | _(Gemini)_ | Set to `http://localhost:11434/v1` for Ollama | +| `LANGEXTRACT_API_KEY` | _(Gemini key)_ | Set to `ollama` for local | +| `LANGEXTRACT_MODEL` | _(Gemini model)_ | Set to `llama3.1:8b` or preferred model | + +### Switch to Ollama + +```bash +export LANGEXTRACT_PROVIDER=openai_compat +export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 +export LANGEXTRACT_API_KEY=ollama +export LANGEXTRACT_MODEL=llama3.1:8b +``` + +--- + +## Mission Control Dashboard + +| Variable | Default | Purpose | +| ------------ | ------------------------ | -------------------------------------- | +| `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL (used by API routes) | +| `PORT` | `3100` | Dashboard dev server port | + +### Start with Custom Ollama URL + +```bash +OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100 +``` + +--- + +## Whisper.cpp + +Whisper.cpp uses CLI flags rather than environment variables: + +| Flag | Purpose | Example | +| ----------------- | ----------------------------- | -------------------------------------------------- | +| `--model` | Path to GGML model file | `--model ~/whisper-models/ggml-large-v3-turbo.bin` | +| `--language` | Input language | `--language en` | +| `--file` | Audio file path | `--file recording.wav` | +| `--output-json` | Output in JSON format | `--output-json` | +| `--output-srt` | Output as SRT subtitles | `--output-srt` | +| `--output-vtt` | Output as VTT subtitles | `--output-vtt` | +| `--translate` | Translate to English | `--translate` | +| `--threads` | Number of CPU threads | `--threads 8` | +| `--processors` | Number of processors | `--processors 1` | +| `--print-colors` | Colorize output by confidence | `--print-colors` | +| `--no-timestamps` | Omit timestamps | `--no-timestamps` | +| `--port` | Server port (whisper-server) | `--port 8080` | + +--- + +## Proxy / Network (Corporate) + +| Variable | Value on This Machine | Purpose | +| ------------------------------ | -------------------------------- | ------------------------------------------------- | +| `HTTP_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTP proxy | +| `HTTPS_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTPS proxy | +| `NODE_TLS_REJECT_UNAUTHORIZED` | `0` | Bypass Forcepoint SSL interception for Node.js | +| `NO_PROXY` | _(not set by default)_ | Add `ollama.com,registry.ollama.ai` if pulls fail | + +--- + +## All Paths + +| Path | Content | +| ------------------------------------ | --------------------------- | +| `~/.ollama/models/` | Downloaded Ollama models | +| `~/whisper-models/` | Whisper GGML model files | +| `/opt/homebrew/bin/ollama` | Ollama binary | +| `/opt/homebrew/bin/whisper-cli` | Whisper CLI binary | +| `/opt/homebrew/bin/ffmpeg` | FFmpeg binary | +| `__LOCAL_LLMs/dashboard/` | Mission Control Next.js app | +| `__LOCAL_LLMs/docs/` | This documentation | +| `services/extraction-service/evals/` | Promptfoo eval configs |