docs: add local LLM setup guide for Apple Silicon Mac (48GB)

- Add __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md: comprehensive reference
  for running Ollama on the dev Mac covering installation (v0.16.2 via brew),
  corp proxy handling (AT&T Forcepoint), OpenAI-compat API usage examples
  (curl/Node/Python), extraction-service eval integration, Python sidecar
  wiring, model recommendations by use case, troubleshooting, and env var
  reference
- Models documented: llama3.1:8b (4.9GB, default evals), qwen2.5-coder:32b
  (19GB, code gen / Swift / TS)
This commit is contained in:
saravanakumardb1 2026-02-19 12:19:44 -08:00
parent f0accc0946
commit dd23f6cf96

View File

@ -0,0 +1,278 @@
# Local LLM Setup — ByteLyst / LysnrAI / MindLyst
> Everything needed to run local OSS models for development, evals, and offline experimentation.
> Last updated: 2026-02-19
---
## Overview
We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
---
## Installation
### 1. Install Ollama
```bash
brew install ollama
```
Version installed: **0.16.2**
Binary: `/opt/homebrew/opt/ollama/bin/ollama`
Models stored at: `~/.ollama/models/`
### 2. Start the server
```bash
# Option A: foreground (dev)
ollama serve
# Option B: background service (auto-start at login)
brew services start ollama
```
Server listens on: `http://127.0.0.1:11434`
> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
> Model downloads go through it — if a pull fails with SSL errors, try:
>
> ```bash
> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
> ```
### 3. Pull a model
```bash
ollama pull llama3.1:8b # recommended default (4.9 GB)
ollama pull qwen2.5:7b # strong JSON output (4.7 GB)
ollama pull phi4 # good reasoning (8.5 GB)
```
---
## Models Installed
| Model | Size | Pull command | Notes |
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals |
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
Check installed models:
```bash
ollama list
```
---
## Performance on This Machine
- **Hardware:** Apple Silicon Mac (Metal GPU backend)
- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
- **Inference speed:** ~3050 tok/s on M2/M3, ~1015 tok/s on M1
- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
---
## OpenAI-Compatible API
Ollama exposes a drop-in OpenAI API:
```
Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string)
```
### Example: curl
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
"response_format": {"type": "json_object"}
}'
```
### Example: Node.js (OpenAI SDK)
```typescript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await client.chat.completions.create({
model: 'llama3.1:8b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
```
### Example: Python
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
```
---
## Extraction Service Evals
The extraction-service has a full promptfoo eval suite that can run against Ollama.
### Files
| File | Purpose |
| --------------------------------------------------------- | -------------------------------------------------- |
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
| `services/extraction-service/evals/README.md` | Full usage docs |
### Running
```bash
cd services/extraction-service
# Ollama only (no extraction-service needed)
pnpm eval:ollama
# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
```
### Eval Coverage
| Task | Cases | Key assertions |
| ----------------------- | ----- | --------------------------------------------------------------- |
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
**Total: 19 cases, 50+ assertions**
### Important: Assertion Pattern
Ollama returns a raw JSON **string** — assertions must parse it inline:
```yaml
# ✅ Correct
- type: javascript
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
# ❌ Wrong — output is a string, not an object
- type: javascript
value: output.classes.includes('action')
```
### Cost
- **Gemini (via extraction-service):** ~$0.0030.005 per full run (gemini-2.5-flash)
- **Ollama (local):** $0.00 — fully offline after model download
---
## Pointing the Python Sidecar at Ollama
The extraction-service Python sidecar (LangExtract) uses Gemini by default.
To switch to Ollama for local dev, set these env vars before starting the sidecar:
```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```
> Check `services/extraction-service/python/` for the exact env var names — the sidecar
> config may use different keys depending on the LangExtract version.
---
## Recommended Models by Use Case
| Use case | Recommended model | Why |
| ------------------------------- | ----------------- | ------------------------------------ |
| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast |
| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output |
| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB |
| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM |
---
## Troubleshooting
**`MLX dynamic library not available`**
→ Harmless warning. Ollama falls back to Metal. No action needed.
**Model pull fails (SSL / proxy)**
```bash
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
```
**Ollama not responding**
```bash
# Check if running
curl http://localhost:11434/api/tags
# Restart
brew services restart ollama
# or
pkill ollama && ollama serve
```
**JSON parse errors in evals**
→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
**Slow inference**
→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
```bash
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
```
(These flags were shown in the Homebrew install output.)
---
## Environment Variables Reference
| Variable | Default | Purpose |
| ------------------------ | --------------------------- | ------------------------------------------------ |
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) |
| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests |
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config |