# 02 — Ollama Setup & Models

> Installation, server configuration, model management, and memory behavior.

---

## Installation

```bash
brew install ollama
```

- **Version installed:** 0.16.2
- **Binary:** `/opt/homebrew/opt/ollama/bin/ollama`
- **Models stored at:** `~/.ollama/models/`
- **Config:** No config file — uses environment variables

---

## Starting the Server

```bash
# Option A: foreground (dev, see logs)
ollama serve

# Option B: background service (auto-start at login)
brew services start ollama

# Check if running
curl http://localhost:11434/api/tags
```

**Server listens on:** `http://127.0.0.1:11434`

### Corporate Proxy Note

Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails:

```bash
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
```

---

## Models Currently Installed

Verified 2026-02-19:

| Model               | Disk Size | Parameters | Quantization | Status    | Use Case                                      |
| ------------------- | --------- | ---------- | ------------ | --------- | --------------------------------------------- |
| `qwen2.5-coder:32b` | 18.5 GB   | 32.8B      | Q4_K_M       | ✅ Loaded | Best coding model — Swift, TypeScript, Python |
| `llama3.1:8b`       | 4.9 GB    | 8B         | Q4_K_M       | ✅ Loaded | Default for evals, fast inference             |

### Useful Commands

```bash
# List all downloaded models (disk)
ollama list

# Show what's currently loaded in RAM
ollama ps

# Pull a new model (downloads to ~/.ollama/models/)
ollama pull <model>

# Run interactively
ollama run <model>

# Run with a one-shot prompt
ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion"

# Remove a model from disk
ollama rm <model>

# Show model details (size, parameters, template)
ollama show <model>
```

---

## Memory Management

Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine.

### Key Behaviors

1. **Models are stored on disk** — you can download as many as disk allows
2. **Only the active model loads into RAM** — previous model is evicted when switching
3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable)
4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately

### Controlling Idle Timeout

```bash
# Default: 5 minutes
ollama serve

# Unload immediately after each request (saves RAM)
OLLAMA_KEEP_ALIVE=0 ollama serve

# Keep loaded for 30 minutes
OLLAMA_KEEP_ALIVE=30m ollama serve

# Keep loaded forever (until manual unload or server restart)
OLLAMA_KEEP_ALIVE=-1 ollama serve
```

### Manual Load/Unload

```bash
# Load a model into RAM (empty prompt trick)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}'

# Unload a model from RAM immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
```

### How Many Models Can You Have Downloaded?

As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models:

| Count           | Approx Disk |
| --------------- | ----------- |
| 2 (current)     | ~24 GB      |
| 5 (moderate)    | ~55 GB      |
| 10 (full stack) | ~115 GB     |

---

## OpenAI-Compatible API

Ollama exposes a drop-in OpenAI API at:

```
Base URL:  http://localhost:11434/v1
API Key:   ollama  (any non-empty string)
```

### Example: curl

```bash
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
    "response_format": {"type": "json_object"}
  }'
```

### Example: Node.js (OpenAI SDK)

```typescript
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const res = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
});
```

### Example: Python

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
)
```

---

## Native Ollama API

Beyond the OpenAI-compatible endpoint, Ollama has its own API:

| Endpoint          | Method | Purpose                             |
| ----------------- | ------ | ----------------------------------- |
| `/api/tags`       | GET    | List all downloaded models          |
| `/api/ps`         | GET    | List models currently loaded in RAM |
| `/api/generate`   | POST   | Generate text (single-turn)         |
| `/api/chat`       | POST   | Chat completion (multi-turn)        |
| `/api/pull`       | POST   | Download a model                    |
| `/api/delete`     | DELETE | Remove a model from disk            |
| `/api/show`       | POST   | Show model metadata                 |
| `/api/embeddings` | POST   | Generate embeddings                 |

Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md

---

## Performance on M4 Pro 48 GB

- **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically
- **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory
- **Inference speed estimates:**
  - 7B models: ~40-60 tok/s
  - 32B models: ~15-25 tok/s
  - 70B (Q4): ~5-10 tok/s
- **RAM usage (model loaded):**
  - 7B: ~5-6 GB
  - 32B: ~20-22 GB
  - 70B (Q4): ~40-42 GB

### Performance Tuning

```bash
# Enable flash attention (faster, less RAM)
OLLAMA_FLASH_ATTENTION=1 ollama serve

# KV cache quantization (smaller RAM footprint)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Both together
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

# Allow concurrent requests (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve
```