learning_ai_common_plat/__LOCAL_LLMs/docs/02-ollama-setup-and-models.md
saravanakumardb1 80f794dee7 docs(local-llm): add Ollama setup, extraction evals, and env vars reference
- docs/02-ollama-setup-and-models.md: installation, server config, memory management,
  idle timeout, manual load/unload, OpenAI-compatible API, native API reference,
  performance tuning flags (flash attention, KV cache)
- docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases
  across 5 tasks, assertion patterns for JSON string output, Python sidecar config
- docs/09-environment-variables.md: comprehensive var reference for Ollama server,
  evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings
2026-02-19 13:01:05 -08:00

6.1 KiB

02 — Ollama Setup & Models

Installation, server configuration, model management, and memory behavior.


Installation

brew install ollama
  • Version installed: 0.16.2
  • Binary: /opt/homebrew/opt/ollama/bin/ollama
  • Models stored at: ~/.ollama/models/
  • Config: No config file — uses environment variables

Starting the Server

# Option A: foreground (dev, see logs)
ollama serve

# Option B: background service (auto-start at login)
brew services start ollama

# Check if running
curl http://localhost:11434/api/tags

Server listens on: http://127.0.0.1:11434

Corporate Proxy Note

Ollama auto-detects HTTP_PROXY / HTTPS_PROXY from the environment. On this machine, the AT&T Forcepoint proxy (http://cso.proxy.att.com:8080/) is picked up automatically. Model downloads go through it. If a pull fails:

NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>

Models Currently Installed

Verified 2026-02-19:

Model Size Pull Command Use Case
qwen2.5-coder:32b 19 GB ollama pull qwen2.5-coder:32b Best coding model — Swift, TypeScript, Python
llama3.1:8b 4.9 GB ollama pull llama3.1:8b Default for evals, fast inference

Useful Commands

# List all downloaded models (disk)
ollama list

# Show what's currently loaded in RAM
ollama ps

# Pull a new model (downloads to ~/.ollama/models/)
ollama pull <model>

# Run interactively
ollama run <model>

# Run with a one-shot prompt
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"

# Remove a model from disk
ollama rm <model>

# Show model details (size, parameters, template)
ollama show <model>

Memory Management

Ollama loads one model at a time into RAM by default. This is critical for a 48 GB machine.

Key Behaviors

  1. Models are stored on disk — you can download as many as disk allows
  2. Only the active model loads into RAM — previous model is evicted when switching
  3. Idle timeout: Models auto-unload after 5 minutes of inactivity (configurable)
  4. Manual unload: Send a request with keep_alive: "0" to unload immediately

Controlling Idle Timeout

# Default: 5 minutes
ollama serve

# Unload immediately after each request (saves RAM)
OLLAMA_KEEP_ALIVE=0 ollama serve

# Keep loaded for 30 minutes
OLLAMA_KEEP_ALIVE=30m ollama serve

# Keep loaded forever (until manual unload or server restart)
OLLAMA_KEEP_ALIVE=-1 ollama serve

Manual Load/Unload

# Load a model into RAM (empty prompt trick)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'

# Unload a model from RAM immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'

How Many Models Can You Have Downloaded?

As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:

Count Approx Disk
2 (current) ~24 GB
5 (moderate) ~55 GB
10 (full stack) ~115 GB

OpenAI-Compatible API

Ollama exposes a drop-in OpenAI API at:

Base URL:  http://localhost:11434/v1
API Key:   ollama  (any non-empty string)

Example: curl

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
    "response_format": {"type": "json_object"}
  }'

Example: Node.js (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});

Example: Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)

Native Ollama API

Beyond the OpenAI-compatible endpoint, Ollama has its own API:

Endpoint Method Purpose
/api/tags GET List all downloaded models
/api/ps GET List models currently loaded in RAM
/api/generate POST Generate text (single-turn)
/api/chat POST Chat completion (multi-turn)
/api/pull POST Download a model
/api/delete DELETE Remove a model from disk
/api/show POST Show model metadata
/api/embeddings POST Generate embeddings

Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md


Performance on M4 Pro 48 GB

  • MLX warning: MLX dynamic library not availableharmless, falls back to Metal/CPU automatically
  • Metal backend: Fully utilized on Apple Silicon — near-GPU speeds via unified memory
  • Inference speed estimates:
    • 7B models: ~40-60 tok/s
    • 32B models: ~15-25 tok/s
    • 70B (Q4): ~5-10 tok/s
  • RAM usage (model loaded):
    • 7B: ~5-6 GB
    • 32B: ~20-22 GB
    • 70B (Q4): ~40-42 GB

Performance Tuning

# Enable flash attention (faster, less RAM)
OLLAMA_FLASH_ATTENTION=1 ollama serve

# KV cache quantization (smaller RAM footprint)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Both together
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Allow concurrent requests (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve