docs(local-llm): add Ollama setup, extraction evals, and env vars reference
- docs/02-ollama-setup-and-models.md: installation, server config, memory management, idle timeout, manual load/unload, OpenAI-compatible API, native API reference, performance tuning flags (flash attention, KV cache) - docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases across 5 tasks, assertion patterns for JSON string output, Python sidecar config - docs/09-environment-variables.md: comprehensive var reference for Ollama server, evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
This commit is contained in:
parent
464ffb92ec
commit
80f794dee7
230
__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
Normal file
230
__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
Normal file
@ -0,0 +1,230 @@
|
||||
# 02 — Ollama Setup & Models
|
||||
|
||||
> Installation, server configuration, model management, and memory behavior.
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
- **Version installed:** 0.16.2
|
||||
- **Binary:** `/opt/homebrew/opt/ollama/bin/ollama`
|
||||
- **Models stored at:** `~/.ollama/models/`
|
||||
- **Config:** No config file — uses environment variables
|
||||
|
||||
---
|
||||
|
||||
## Starting the Server
|
||||
|
||||
```bash
|
||||
# Option A: foreground (dev, see logs)
|
||||
ollama serve
|
||||
|
||||
# Option B: background service (auto-start at login)
|
||||
brew services start ollama
|
||||
|
||||
# Check if running
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
**Server listens on:** `http://127.0.0.1:11434`
|
||||
|
||||
### Corporate Proxy Note
|
||||
|
||||
Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails:
|
||||
|
||||
```bash
|
||||
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Models Currently Installed
|
||||
|
||||
Verified 2026-02-19:
|
||||
|
||||
| Model | Size | Pull Command | Use Case |
|
||||
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
|
||||
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | Best coding model — Swift, TypeScript, Python |
|
||||
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | Default for evals, fast inference |
|
||||
|
||||
### Useful Commands
|
||||
|
||||
```bash
|
||||
# List all downloaded models (disk)
|
||||
ollama list
|
||||
|
||||
# Show what's currently loaded in RAM
|
||||
ollama ps
|
||||
|
||||
# Pull a new model (downloads to ~/.ollama/models/)
|
||||
ollama pull <model>
|
||||
|
||||
# Run interactively
|
||||
ollama run <model>
|
||||
|
||||
# Run with a one-shot prompt
|
||||
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"
|
||||
|
||||
# Remove a model from disk
|
||||
ollama rm <model>
|
||||
|
||||
# Show model details (size, parameters, template)
|
||||
ollama show <model>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Memory Management
|
||||
|
||||
Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine.
|
||||
|
||||
### Key Behaviors
|
||||
|
||||
1. **Models are stored on disk** — you can download as many as disk allows
|
||||
2. **Only the active model loads into RAM** — previous model is evicted when switching
|
||||
3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable)
|
||||
4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately
|
||||
|
||||
### Controlling Idle Timeout
|
||||
|
||||
```bash
|
||||
# Default: 5 minutes
|
||||
ollama serve
|
||||
|
||||
# Unload immediately after each request (saves RAM)
|
||||
OLLAMA_KEEP_ALIVE=0 ollama serve
|
||||
|
||||
# Keep loaded for 30 minutes
|
||||
OLLAMA_KEEP_ALIVE=30m ollama serve
|
||||
|
||||
# Keep loaded forever (until manual unload or server restart)
|
||||
OLLAMA_KEEP_ALIVE=-1 ollama serve
|
||||
```
|
||||
|
||||
### Manual Load/Unload
|
||||
|
||||
```bash
|
||||
# Load a model into RAM (empty prompt trick)
|
||||
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'
|
||||
|
||||
# Unload a model from RAM immediately
|
||||
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
|
||||
```
|
||||
|
||||
### How Many Models Can You Have Downloaded?
|
||||
|
||||
As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:
|
||||
|
||||
| Count | Approx Disk |
|
||||
| --------------- | ----------- |
|
||||
| 2 (current) | ~24 GB |
|
||||
| 5 (moderate) | ~55 GB |
|
||||
| 10 (full stack) | ~115 GB |
|
||||
|
||||
---
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
Ollama exposes a drop-in OpenAI API at:
|
||||
|
||||
```
|
||||
Base URL: http://localhost:11434/v1
|
||||
API Key: ollama (any non-empty string)
|
||||
```
|
||||
|
||||
### Example: curl
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.1:8b",
|
||||
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
|
||||
"response_format": {"type": "json_object"}
|
||||
}'
|
||||
```
|
||||
|
||||
### Example: Node.js (OpenAI SDK)
|
||||
|
||||
```typescript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const client = new OpenAI({
|
||||
baseURL: 'http://localhost:11434/v1',
|
||||
apiKey: 'ollama',
|
||||
});
|
||||
|
||||
const res = await client.chat.completions.create({
|
||||
model: 'llama3.1:8b',
|
||||
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
|
||||
response_format: { type: 'json_object' },
|
||||
});
|
||||
```
|
||||
|
||||
### Example: Python
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llama3.1:8b",
|
||||
messages=[{"role": "user", "content": "Extract action items from: ..."}],
|
||||
response_format={"type": "json_object"},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Native Ollama API
|
||||
|
||||
Beyond the OpenAI-compatible endpoint, Ollama has its own API:
|
||||
|
||||
| Endpoint | Method | Purpose |
|
||||
| ----------------- | ------ | ----------------------------------- |
|
||||
| `/api/tags` | GET | List all downloaded models |
|
||||
| `/api/ps` | GET | List models currently loaded in RAM |
|
||||
| `/api/generate` | POST | Generate text (single-turn) |
|
||||
| `/api/chat` | POST | Chat completion (multi-turn) |
|
||||
| `/api/pull` | POST | Download a model |
|
||||
| `/api/delete` | DELETE | Remove a model from disk |
|
||||
| `/api/show` | POST | Show model metadata |
|
||||
| `/api/embeddings` | POST | Generate embeddings |
|
||||
|
||||
Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md
|
||||
|
||||
---
|
||||
|
||||
## Performance on M4 Pro 48 GB
|
||||
|
||||
- **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically
|
||||
- **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory
|
||||
- **Inference speed estimates:**
|
||||
- 7B models: ~40-60 tok/s
|
||||
- 32B models: ~15-25 tok/s
|
||||
- 70B (Q4): ~5-10 tok/s
|
||||
- **RAM usage (model loaded):**
|
||||
- 7B: ~5-6 GB
|
||||
- 32B: ~20-22 GB
|
||||
- 70B (Q4): ~40-42 GB
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
```bash
|
||||
# Enable flash attention (faster, less RAM)
|
||||
OLLAMA_FLASH_ATTENTION=1 ollama serve
|
||||
|
||||
# KV cache quantization (smaller RAM footprint)
|
||||
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
|
||||
|
||||
# Both together
|
||||
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
|
||||
|
||||
# Allow concurrent requests (default: 1)
|
||||
OLLAMA_NUM_PARALLEL=2 ollama serve
|
||||
```
|
||||
113
__LOCAL_LLMs/docs/06-extraction-service-evals.md
Normal file
113
__LOCAL_LLMs/docs/06-extraction-service-evals.md
Normal file
@ -0,0 +1,113 @@
|
||||
# 06 — Extraction Service Evals
|
||||
|
||||
> Running the promptfoo eval suite against Ollama for offline, zero-cost model evaluation.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The extraction-service has a full promptfoo eval suite that can run against local Ollama models instead of (or alongside) cloud Gemini. This enables:
|
||||
|
||||
- **Zero-cost iteration** on extraction prompts
|
||||
- **Side-by-side comparison** of local vs cloud model quality
|
||||
- **Offline development** when cloud APIs are unavailable
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
| --------------------------------------------------------- | -------------------------------------------------- |
|
||||
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
|
||||
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
|
||||
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
|
||||
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
|
||||
| `services/extraction-service/evals/README.md` | Full usage docs |
|
||||
|
||||
---
|
||||
|
||||
## Running Evals
|
||||
|
||||
```bash
|
||||
cd services/extraction-service
|
||||
|
||||
# Ollama only (no extraction-service needed)
|
||||
pnpm eval:ollama
|
||||
|
||||
# Different model
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
||||
|
||||
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
|
||||
pnpm eval:compare
|
||||
```
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Ollama must be running (`ollama serve`)
|
||||
- A model must be available (`ollama pull llama3.1:8b`)
|
||||
- For comparison: extraction-service must be running with `EXTRACTION_EVAL_TOKEN` set
|
||||
|
||||
---
|
||||
|
||||
## Eval Coverage
|
||||
|
||||
| Task | Cases | Key Assertions |
|
||||
| ----------------------- | ----- | --------------------------------------------------------------- |
|
||||
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
|
||||
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
|
||||
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
|
||||
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
|
||||
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
|
||||
|
||||
**Total: 19 cases, 50+ assertions**
|
||||
|
||||
---
|
||||
|
||||
## Important: Assertion Pattern
|
||||
|
||||
Ollama returns a raw JSON **string** — assertions must parse it inline:
|
||||
|
||||
```yaml
|
||||
# ✅ Correct — parse the string first
|
||||
- type: javascript
|
||||
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
|
||||
|
||||
# ❌ Wrong — output is a string, not an object
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pointing the Python Sidecar at Ollama
|
||||
|
||||
The extraction-service Python sidecar (LangExtract) uses Gemini by default. To switch to Ollama for local dev:
|
||||
|
||||
```bash
|
||||
export LANGEXTRACT_PROVIDER=openai_compat
|
||||
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||
export LANGEXTRACT_API_KEY=ollama
|
||||
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||
```
|
||||
|
||||
> Check `services/extraction-service/python/` for exact env var names — the sidecar config may use different keys depending on LangExtract version.
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
| Provider | Cost per full run | Notes |
|
||||
| ----------------------------------- | ----------------- | ---------------------------------- |
|
||||
| **Gemini** (via extraction-service) | ~$0.003–0.005 | gemini-2.5-flash |
|
||||
| **Ollama** (local) | $0.00 | Fully offline after model download |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Models for Evals
|
||||
|
||||
| Model | JSON Quality | Speed | Notes |
|
||||
| ------------------- | ------------ | -------- | ------------------------------- |
|
||||
| `llama3.1:8b` | Good | Fast | Default, reliable JSON output |
|
||||
| `qwen2.5:7b` | Excellent | Fast | Best JSON structure compliance |
|
||||
| `qwen2.5-coder:32b` | Excellent | Moderate | Best quality, slower |
|
||||
| `phi4` | Good | Fast | Good reasoning for triage tasks |
|
||||
135
__LOCAL_LLMs/docs/09-environment-variables.md
Normal file
135
__LOCAL_LLMs/docs/09-environment-variables.md
Normal file
@ -0,0 +1,135 @@
|
||||
# 09 — Environment Variables Reference
|
||||
|
||||
> All configuration variables for Ollama, Whisper, dashboard, and evals.
|
||||
|
||||
---
|
||||
|
||||
## Ollama Server
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| -------------------------- | ------------------------ | ------------------------------------------------------ |
|
||||
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
|
||||
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
|
||||
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
|
||||
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
|
||||
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM footprint) |
|
||||
| `OLLAMA_NUM_PARALLEL` | `1` | Number of concurrent requests |
|
||||
| `OLLAMA_MAX_LOADED_MODELS` | `1` | Max models loaded in RAM simultaneously |
|
||||
| `OLLAMA_GPU_OVERHEAD` | _(auto)_ | Reserved GPU memory (bytes) |
|
||||
| `OLLAMA_ORIGINS` | `*` | Allowed CORS origins |
|
||||
| `OLLAMA_DEBUG` | `false` | Enable debug logging |
|
||||
| `HTTP_PROXY` | _(system)_ | HTTP proxy for model downloads |
|
||||
| `HTTPS_PROXY` | _(system)_ | HTTPS proxy for model downloads |
|
||||
| `NO_PROXY` | _(none)_ | Hosts to bypass proxy |
|
||||
|
||||
### Performance Tuning Combo
|
||||
|
||||
```bash
|
||||
OLLAMA_FLASH_ATTENTION=1 \
|
||||
OLLAMA_KV_CACHE_TYPE=q8_0 \
|
||||
OLLAMA_NUM_PARALLEL=2 \
|
||||
OLLAMA_KEEP_ALIVE=10m \
|
||||
ollama serve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extraction Service Evals (promptfoo)
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| ----------------------- | --------------------------- | --------------------------------------- |
|
||||
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
|
||||
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | OpenAI-compat endpoint for promptfoo |
|
||||
| `EXTRACTION_EVAL_TOKEN` | _(none)_ | Auth token for extraction-service evals |
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Run evals with a different model
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
||||
|
||||
# Compare Gemini vs Ollama
|
||||
EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Python Sidecar (LangExtract)
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| ---------------------- | ---------------- | --------------------------------------------- |
|
||||
| `LANGEXTRACT_PROVIDER` | `gemini` | Switch to `openai_compat` for Ollama |
|
||||
| `LANGEXTRACT_BASE_URL` | _(Gemini)_ | Set to `http://localhost:11434/v1` for Ollama |
|
||||
| `LANGEXTRACT_API_KEY` | _(Gemini key)_ | Set to `ollama` for local |
|
||||
| `LANGEXTRACT_MODEL` | _(Gemini model)_ | Set to `llama3.1:8b` or preferred model |
|
||||
|
||||
### Switch to Ollama
|
||||
|
||||
```bash
|
||||
export LANGEXTRACT_PROVIDER=openai_compat
|
||||
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||
export LANGEXTRACT_API_KEY=ollama
|
||||
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mission Control Dashboard
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| ------------ | ------------------------ | -------------------------------------- |
|
||||
| `OLLAMA_URL` | `http://localhost:11434` | Ollama server URL (used by API routes) |
|
||||
| `PORT` | `3100` | Dashboard dev server port |
|
||||
|
||||
### Start with Custom Ollama URL
|
||||
|
||||
```bash
|
||||
OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Whisper.cpp
|
||||
|
||||
Whisper.cpp uses CLI flags rather than environment variables:
|
||||
|
||||
| Flag | Purpose | Example |
|
||||
| ----------------- | ----------------------------- | -------------------------------------------------- |
|
||||
| `--model` | Path to GGML model file | `--model ~/whisper-models/ggml-large-v3-turbo.bin` |
|
||||
| `--language` | Input language | `--language en` |
|
||||
| `--file` | Audio file path | `--file recording.wav` |
|
||||
| `--output-json` | Output in JSON format | `--output-json` |
|
||||
| `--output-srt` | Output as SRT subtitles | `--output-srt` |
|
||||
| `--output-vtt` | Output as VTT subtitles | `--output-vtt` |
|
||||
| `--translate` | Translate to English | `--translate` |
|
||||
| `--threads` | Number of CPU threads | `--threads 8` |
|
||||
| `--processors` | Number of processors | `--processors 1` |
|
||||
| `--print-colors` | Colorize output by confidence | `--print-colors` |
|
||||
| `--no-timestamps` | Omit timestamps | `--no-timestamps` |
|
||||
| `--port` | Server port (whisper-server) | `--port 8080` |
|
||||
|
||||
---
|
||||
|
||||
## Proxy / Network (Corporate)
|
||||
|
||||
| Variable | Value on This Machine | Purpose |
|
||||
| ------------------------------ | -------------------------------- | ------------------------------------------------- |
|
||||
| `HTTP_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTP proxy |
|
||||
| `HTTPS_PROXY` | `http://cso.proxy.att.com:8080/` | Corporate HTTPS proxy |
|
||||
| `NODE_TLS_REJECT_UNAUTHORIZED` | `0` | Bypass Forcepoint SSL interception for Node.js |
|
||||
| `NO_PROXY` | _(not set by default)_ | Add `ollama.com,registry.ollama.ai` if pulls fail |
|
||||
|
||||
---
|
||||
|
||||
## All Paths
|
||||
|
||||
| Path | Content |
|
||||
| ------------------------------------ | --------------------------- |
|
||||
| `~/.ollama/models/` | Downloaded Ollama models |
|
||||
| `~/whisper-models/` | Whisper GGML model files |
|
||||
| `/opt/homebrew/bin/ollama` | Ollama binary |
|
||||
| `/opt/homebrew/bin/whisper-cli` | Whisper CLI binary |
|
||||
| `/opt/homebrew/bin/ffmpeg` | FFmpeg binary |
|
||||
| `__LOCAL_LLMs/dashboard/` | Mission Control Next.js app |
|
||||
| `__LOCAL_LLMs/docs/` | This documentation |
|
||||
| `services/extraction-service/evals/` | Promptfoo eval configs |
|
||||
Loading…
Reference in New Issue
Block a user