saravanakumardb1 80f794dee7 docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management,
  idle timeout, manual load/unload, OpenAI-compatible API, native API reference,
  performance tuning flags (flash attention, KV cache)
- docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases
  across 5 tasks, assertion patterns for JSON string output, Python sidecar config
- docs/09-environment-variables.md: comprehensive var reference for Ollama server,
  evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings

2026-02-19 13:01:05 -08:00

6.1 KiB

Raw Blame History

02 — Ollama Setup & Models

Installation, server configuration, model management, and memory behavior.

Installation

brew install ollama

Version installed: 0.16.2
Binary: /opt/homebrew/opt/ollama/bin/ollama
Models stored at: ~/.ollama/models/
Config: No config file — uses environment variables

Starting the Server

# Option A: foreground (dev, see logs)
ollama serve

# Option B: background service (auto-start at login)
brew services start ollama

# Check if running
curl http://localhost:11434/api/tags

Server listens on: http://127.0.0.1:11434

Corporate Proxy Note

Ollama auto-detects HTTP_PROXY / HTTPS_PROXY from the environment. On this machine, the AT&T Forcepoint proxy (http://cso.proxy.att.com:8080/) is picked up automatically. Model downloads go through it. If a pull fails:

NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>

Models Currently Installed

Verified 2026-02-19:

Model	Size	Pull Command	Use Case
`qwen2.5-coder:32b`	19 GB	`ollama pull qwen2.5-coder:32b`	Best coding model — Swift, TypeScript, Python
`llama3.1:8b`	4.9 GB	`ollama pull llama3.1:8b`	Default for evals, fast inference

Useful Commands

# List all downloaded models (disk)
ollama list

# Show what's currently loaded in RAM
ollama ps

# Pull a new model (downloads to ~/.ollama/models/)
ollama pull <model>

# Run interactively
ollama run <model>

# Run with a one-shot prompt
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"

# Remove a model from disk
ollama rm <model>

# Show model details (size, parameters, template)
ollama show <model>

Memory Management

Ollama loads one model at a time into RAM by default. This is critical for a 48 GB machine.

Key Behaviors

Models are stored on disk — you can download as many as disk allows
Only the active model loads into RAM — previous model is evicted when switching
Idle timeout: Models auto-unload after 5 minutes of inactivity (configurable)
Manual unload: Send a request with keep_alive: "0" to unload immediately

Controlling Idle Timeout

# Default: 5 minutes
ollama serve

# Unload immediately after each request (saves RAM)
OLLAMA_KEEP_ALIVE=0 ollama serve

# Keep loaded for 30 minutes
OLLAMA_KEEP_ALIVE=30m ollama serve

# Keep loaded forever (until manual unload or server restart)
OLLAMA_KEEP_ALIVE=-1 ollama serve

Manual Load/Unload

# Load a model into RAM (empty prompt trick)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'

# Unload a model from RAM immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'

How Many Models Can You Have Downloaded?

As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:

Count	Approx Disk
2 (current)	~24 GB
5 (moderate)	~55 GB
10 (full stack)	~115 GB

OpenAI-Compatible API

Ollama exposes a drop-in OpenAI API at:

Base URL:  http://localhost:11434/v1
API Key:   ollama  (any non-empty string)

Example: curl

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
    "response_format": {"type": "json_object"}
  }'

Example: Node.js (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});

Example: Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)

Native Ollama API

Beyond the OpenAI-compatible endpoint, Ollama has its own API:

Endpoint	Method	Purpose
`/api/tags`	GET	List all downloaded models
`/api/ps`	GET	List models currently loaded in RAM
`/api/generate`	POST	Generate text (single-turn)
`/api/chat`	POST	Chat completion (multi-turn)
`/api/pull`	POST	Download a model
`/api/delete`	DELETE	Remove a model from disk
`/api/show`	POST	Show model metadata
`/api/embeddings`	POST	Generate embeddings

Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md

Performance on M4 Pro 48 GB

MLX warning: MLX dynamic library not available — harmless, falls back to Metal/CPU automatically
Metal backend: Fully utilized on Apple Silicon — near-GPU speeds via unified memory
Inference speed estimates:
- 7B models: ~40-60 tok/s
- 32B models: ~15-25 tok/s
- 70B (Q4): ~5-10 tok/s
RAM usage (model loaded):
- 7B: ~5-6 GB
- 32B: ~20-22 GB
- 70B (Q4): ~40-42 GB

Performance Tuning

# Enable flash attention (faster, less RAM)
OLLAMA_FLASH_ATTENTION=1 ollama serve

# KV cache quantization (smaller RAM footprint)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Both together
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Allow concurrent requests (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve

6.1 KiB Raw Blame History