# 02 — Ollama Setup & Models > Installation, server configuration, model management, and memory behavior. --- ## Installation ```bash brew install ollama ``` - **Version installed:** 0.16.2 - **Binary:** `/opt/homebrew/opt/ollama/bin/ollama` - **Models stored at:** `~/.ollama/models/` - **Config:** No config file — uses environment variables --- ## Starting the Server ```bash # Option A: foreground (dev, see logs) ollama serve # Option B: background service (auto-start at login) brew services start ollama # Check if running curl http://localhost:11434/api/tags ``` **Server listens on:** `http://127.0.0.1:11434` ### Corporate Proxy Note Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. Model downloads go through it. If a pull fails: ```bash NO_PROXY="ollama.com,registry.ollama.ai" ollama pull ``` --- ## Models Currently Installed Verified 2026-02-19: | Model | Disk Size | Parameters | Quantization | Status | Use Case | | ------------------- | --------- | ---------- | ------------ | --------- | --------------------------------------------- | | `qwen2.5-coder:32b` | 18.5 GB | 32.8B | Q4_K_M | ✅ Loaded | Best coding model — Swift, TypeScript, Python | | `llama3.1:8b` | 4.9 GB | 8B | Q4_K_M | ✅ Loaded | Default for evals, fast inference | ### Useful Commands ```bash # List all downloaded models (disk) ollama list # Show what's currently loaded in RAM ollama ps # Pull a new model (downloads to ~/.ollama/models/) ollama pull # Run interactively ollama run # Run with a one-shot prompt ollama run qwen2.5-coder:32b "Write a Swift function for audio conversion" # Remove a model from disk ollama rm # Show model details (size, parameters, template) ollama show ``` --- ## Memory Management Ollama loads **one model at a time** into RAM by default. This is critical for a 48 GB machine. ### Key Behaviors 1. **Models are stored on disk** — you can download as many as disk allows 2. **Only the active model loads into RAM** — previous model is evicted when switching 3. **Idle timeout:** Models auto-unload after **5 minutes** of inactivity (configurable) 4. **Manual unload:** Send a request with `keep_alive: "0"` to unload immediately ### Controlling Idle Timeout ```bash # Default: 5 minutes ollama serve # Unload immediately after each request (saves RAM) OLLAMA_KEEP_ALIVE=0 ollama serve # Keep loaded for 30 minutes OLLAMA_KEEP_ALIVE=30m ollama serve # Keep loaded forever (until manual unload or server restart) OLLAMA_KEEP_ALIVE=-1 ollama serve ``` ### Manual Load/Unload ```bash # Load a model into RAM (empty prompt trick) curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "10m"}' # Unload a model from RAM immediately curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}' ``` ### How Many Models Can You Have Downloaded? As many as your disk allows. Only the loaded model consumes RAM. Plan for 10 models: | Count | Approx Disk | | --------------- | ----------- | | 2 (current) | ~24 GB | | 5 (moderate) | ~55 GB | | 10 (full stack) | ~115 GB | --- ## OpenAI-Compatible API Ollama exposes a drop-in OpenAI API at: ``` Base URL: http://localhost:11434/v1 API Key: ollama (any non-empty string) ``` ### Example: curl ```bash curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1:8b", "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}], "response_format": {"type": "json_object"} }' ``` ### Example: Node.js (OpenAI SDK) ```typescript import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama', }); const res = await client.chat.completions.create({ model: 'llama3.1:8b', messages: [{ role: 'user', content: 'Extract action items from: ...' }], response_format: { type: 'json_object' }, }); ``` ### Example: Python ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") response = client.chat.completions.create( model="llama3.1:8b", messages=[{"role": "user", "content": "Extract action items from: ..."}], response_format={"type": "json_object"}, ) ``` --- ## Native Ollama API Beyond the OpenAI-compatible endpoint, Ollama has its own API: | Endpoint | Method | Purpose | | ----------------- | ------ | ----------------------------------- | | `/api/tags` | GET | List all downloaded models | | `/api/ps` | GET | List models currently loaded in RAM | | `/api/generate` | POST | Generate text (single-turn) | | `/api/chat` | POST | Chat completion (multi-turn) | | `/api/pull` | POST | Download a model | | `/api/delete` | DELETE | Remove a model from disk | | `/api/show` | POST | Show model metadata | | `/api/embeddings` | POST | Generate embeddings | Full docs: https://github.com/ollama/ollama/blob/main/docs/api.md --- ## Performance on M4 Pro 48 GB - **MLX warning:** `MLX dynamic library not available` — **harmless**, falls back to Metal/CPU automatically - **Metal backend:** Fully utilized on Apple Silicon — near-GPU speeds via unified memory - **Inference speed estimates:** - 7B models: ~40-60 tok/s - 32B models: ~15-25 tok/s - 70B (Q4): ~5-10 tok/s - **RAM usage (model loaded):** - 7B: ~5-6 GB - 32B: ~20-22 GB - 70B (Q4): ~40-42 GB ### Performance Tuning ```bash # Enable flash attention (faster, less RAM) OLLAMA_FLASH_ATTENTION=1 ollama serve # KV cache quantization (smaller RAM footprint) OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve # Both together OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve # Allow concurrent requests (default: 1) OLLAMA_NUM_PARALLEL=2 ollama serve ```