- Add Parameters, Quantization, and Status columns to models table - qwen2.5-coder:32b: 32.8B params, Q4_K_M, 18.5 GB disk - llama3.1:8b: 8B params, Q4_K_M, 4.9 GB disk (confirmed via ollama API)
6.1 KiB
6.1 KiB
02 — Ollama Setup & Models
Installation, server configuration, model management, and memory behavior.
Installation
brew install ollama
- Version installed: 0.16.2
- Binary:
/opt/homebrew/opt/ollama/bin/ollama - Models stored at:
~/.ollama/models/ - Config: No config file — uses environment variables
Starting the Server
# Option A: foreground (dev, see logs)
ollama serve
# Option B: background service (auto-start at login)
brew services start ollama
# Check if running
curl http://localhost:11434/api/tags
Server listens on: http://127.0.0.1:11434
Corporate Proxy Note
Ollama auto-detects HTTP_PROXY / HTTPS_PROXY from the environment. On this machine, the AT&T Forcepoint proxy (http://cso.proxy.att.com:8080/) is picked up automatically. Model downloads go through it. If a pull fails:
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
Models Currently Installed
Verified 2026-02-19:
| Model | Disk Size | Parameters | Quantization | Status | Use Case |
|---|---|---|---|---|---|
qwen2.5-coder:32b |
18.5 GB | 32.8B | Q4_K_M | ✅ Loaded | Best coding model — Swift, TypeScript, Python |
llama3.1:8b |
4.9 GB | 8B | Q4_K_M | ✅ Loaded | Default for evals, fast inference |
Useful Commands
# List all downloaded models (disk)
ollama list
# Show what's currently loaded in RAM
ollama ps
# Pull a new model (downloads to ~/.ollama/models/)
ollama pull <model>
# Run interactively
ollama run <model>
# Run with a one-shot prompt
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"
# Remove a model from disk
ollama rm <model>
# Show model details (size, parameters, template)
ollama show <model>
Memory Management
Ollama loads one model at a time into RAM by default. This is critical for a 48 GB machine.
Key Behaviors
- Models are stored on disk — you can download as many as disk allows
- Only the active model loads into RAM — previous model is evicted when switching
- Idle timeout: Models auto-unload after 5 minutes of inactivity (configurable)
- Manual unload: Send a request with
keep_alive: "0"to unload immediately
Controlling Idle Timeout
# Default: 5 minutes
ollama serve
# Unload immediately after each request (saves RAM)
OLLAMA_KEEP_ALIVE=0 ollama serve
# Keep loaded for 30 minutes
OLLAMA_KEEP_ALIVE=30m ollama serve
# Keep loaded forever (until manual unload or server restart)
OLLAMA_KEEP_ALIVE=-1 ollama serve
Manual Load/Unload
# Load a model into RAM (empty prompt trick)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'
# Unload a model from RAM immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
How Many Models Can You Have Downloaded?
As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:
| Count | Approx Disk |
|---|---|
| 2 (current) | ~24 GB |
| 5 (moderate) | ~55 GB |
| 10 (full stack) | ~115 GB |
OpenAI-Compatible API
Ollama exposes a drop-in OpenAI API at:
Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string)
Example: curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
"response_format": {"type": "json_object"}
}'
Example: Node.js (OpenAI SDK)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await client.chat.completions.create({
model: 'llama3.1:8b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
Example: Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
Native Ollama API
Beyond the OpenAI-compatible endpoint, Ollama has its own API:
| Endpoint | Method | Purpose |
|---|---|---|
/api/tags |
GET | List all downloaded models |
/api/ps |
GET | List models currently loaded in RAM |
/api/generate |
POST | Generate text (single-turn) |
/api/chat |
POST | Chat completion (multi-turn) |
/api/pull |
POST | Download a model |
/api/delete |
DELETE | Remove a model from disk |
/api/show |
POST | Show model metadata |
/api/embeddings |
POST | Generate embeddings |
Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md
Performance on M4 Pro 48 GB
- MLX warning:
MLX dynamic library not available— harmless, falls back to Metal/CPU automatically - Metal backend: Fully utilized on Apple Silicon — near-GPU speeds via unified memory
- Inference speed estimates:
- 7B models: ~40-60 tok/s
- 32B models: ~15-25 tok/s
- 70B (Q4): ~5-10 tok/s
- RAM usage (model loaded):
- 7B: ~5-6 GB
- 32B: ~20-22 GB
- 70B (Q4): ~40-42 GB
Performance Tuning
# Enable flash attention (faster, less RAM)
OLLAMA_FLASH_ATTENTION=1 ollama serve
# KV cache quantization (smaller RAM footprint)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# Both together
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# Allow concurrent requests (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve