docs: add local LLM setup guide for Apple Silicon Mac (48GB)
- Add __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md: comprehensive reference for running Ollama on the dev Mac covering installation (v0.16.2 via brew), corp proxy handling (AT&T Forcepoint), OpenAI-compat API usage examples (curl/Node/Python), extraction-service eval integration, Python sidecar wiring, model recommendations by use case, troubleshooting, and env var reference - Models documented: llama3.1:8b (4.9GB, default evals), qwen2.5-coder:32b (19GB, code gen / Swift / TS)
This commit is contained in:
parent
f0accc0946
commit
dd23f6cf96
278
__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
Normal file
278
__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
Normal file
@ -0,0 +1,278 @@
|
||||
# Local LLM Setup — ByteLyst / LysnrAI / MindLyst
|
||||
|
||||
> Everything needed to run local OSS models for development, evals, and offline experimentation.
|
||||
> Last updated: 2026-02-19
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
|
||||
OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
|
||||
|
||||
- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
|
||||
- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
|
||||
- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Install Ollama
|
||||
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
Version installed: **0.16.2**
|
||||
Binary: `/opt/homebrew/opt/ollama/bin/ollama`
|
||||
Models stored at: `~/.ollama/models/`
|
||||
|
||||
### 2. Start the server
|
||||
|
||||
```bash
|
||||
# Option A: foreground (dev)
|
||||
ollama serve
|
||||
|
||||
# Option B: background service (auto-start at login)
|
||||
brew services start ollama
|
||||
```
|
||||
|
||||
Server listens on: `http://127.0.0.1:11434`
|
||||
|
||||
> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
|
||||
> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
|
||||
> Model downloads go through it — if a pull fails with SSL errors, try:
|
||||
>
|
||||
> ```bash
|
||||
> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
|
||||
> ```
|
||||
|
||||
### 3. Pull a model
|
||||
|
||||
```bash
|
||||
ollama pull llama3.1:8b # recommended default (4.9 GB)
|
||||
ollama pull qwen2.5:7b # strong JSON output (4.7 GB)
|
||||
ollama pull phi4 # good reasoning (8.5 GB)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Models Installed
|
||||
|
||||
| Model | Size | Pull command | Notes |
|
||||
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
|
||||
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals |
|
||||
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
|
||||
|
||||
Check installed models:
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance on This Machine
|
||||
|
||||
- **Hardware:** Apple Silicon Mac (Metal GPU backend)
|
||||
- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
|
||||
- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1
|
||||
- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
|
||||
|
||||
---
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
Ollama exposes a drop-in OpenAI API:
|
||||
|
||||
```
|
||||
Base URL: http://localhost:11434/v1
|
||||
API Key: ollama (any non-empty string)
|
||||
```
|
||||
|
||||
### Example: curl
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.1:8b",
|
||||
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
|
||||
"response_format": {"type": "json_object"}
|
||||
}'
|
||||
```
|
||||
|
||||
### Example: Node.js (OpenAI SDK)
|
||||
|
||||
```typescript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const client = new OpenAI({
|
||||
baseURL: 'http://localhost:11434/v1',
|
||||
apiKey: 'ollama',
|
||||
});
|
||||
|
||||
const res = await client.chat.completions.create({
|
||||
model: 'llama3.1:8b',
|
||||
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
|
||||
response_format: { type: 'json_object' },
|
||||
});
|
||||
```
|
||||
|
||||
### Example: Python
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llama3.1:8b",
|
||||
messages=[{"role": "user", "content": "Extract action items from: ..."}],
|
||||
response_format={"type": "json_object"},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extraction Service Evals
|
||||
|
||||
The extraction-service has a full promptfoo eval suite that can run against Ollama.
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
| --------------------------------------------------------- | -------------------------------------------------- |
|
||||
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
|
||||
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
|
||||
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
|
||||
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
|
||||
| `services/extraction-service/evals/README.md` | Full usage docs |
|
||||
|
||||
### Running
|
||||
|
||||
```bash
|
||||
cd services/extraction-service
|
||||
|
||||
# Ollama only (no extraction-service needed)
|
||||
pnpm eval:ollama
|
||||
|
||||
# Different model
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
||||
|
||||
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
|
||||
pnpm eval:compare
|
||||
```
|
||||
|
||||
### Eval Coverage
|
||||
|
||||
| Task | Cases | Key assertions |
|
||||
| ----------------------- | ----- | --------------------------------------------------------------- |
|
||||
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
|
||||
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
|
||||
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
|
||||
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
|
||||
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
|
||||
|
||||
**Total: 19 cases, 50+ assertions**
|
||||
|
||||
### Important: Assertion Pattern
|
||||
|
||||
Ollama returns a raw JSON **string** — assertions must parse it inline:
|
||||
|
||||
```yaml
|
||||
# ✅ Correct
|
||||
- type: javascript
|
||||
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
|
||||
|
||||
# ❌ Wrong — output is a string, not an object
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
```
|
||||
|
||||
### Cost
|
||||
|
||||
- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash)
|
||||
- **Ollama (local):** $0.00 — fully offline after model download
|
||||
|
||||
---
|
||||
|
||||
## Pointing the Python Sidecar at Ollama
|
||||
|
||||
The extraction-service Python sidecar (LangExtract) uses Gemini by default.
|
||||
To switch to Ollama for local dev, set these env vars before starting the sidecar:
|
||||
|
||||
```bash
|
||||
export LANGEXTRACT_PROVIDER=openai_compat
|
||||
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||
export LANGEXTRACT_API_KEY=ollama
|
||||
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||
```
|
||||
|
||||
> Check `services/extraction-service/python/` for the exact env var names — the sidecar
|
||||
> config may use different keys depending on the LangExtract version.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Models by Use Case
|
||||
|
||||
| Use case | Recommended model | Why |
|
||||
| ------------------------------- | ----------------- | ------------------------------------ |
|
||||
| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast |
|
||||
| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output |
|
||||
| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB |
|
||||
| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`MLX dynamic library not available`**
|
||||
→ Harmless warning. Ollama falls back to Metal. No action needed.
|
||||
|
||||
**Model pull fails (SSL / proxy)**
|
||||
|
||||
```bash
|
||||
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
|
||||
```
|
||||
|
||||
**Ollama not responding**
|
||||
|
||||
```bash
|
||||
# Check if running
|
||||
curl http://localhost:11434/api/tags
|
||||
|
||||
# Restart
|
||||
brew services restart ollama
|
||||
# or
|
||||
pkill ollama && ollama serve
|
||||
```
|
||||
|
||||
**JSON parse errors in evals**
|
||||
→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
|
||||
`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
|
||||
|
||||
**Slow inference**
|
||||
→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
|
||||
|
||||
```bash
|
||||
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
|
||||
```
|
||||
|
||||
(These flags were shown in the Homebrew install output.)
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables Reference
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| ------------------------ | --------------------------- | ------------------------------------------------ |
|
||||
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
|
||||
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
|
||||
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
|
||||
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
|
||||
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) |
|
||||
| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests |
|
||||
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
|
||||
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config |
|
||||
Loading…
Reference in New Issue
Block a user