docs: add local LLM setup guide for Apple Silicon Mac (48GB)

- Add __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md: comprehensive reference for running Ollama on the dev Mac covering installation (v0.16.2 via brew), corp proxy handling (AT&T Forcepoint), OpenAI-compat API usage examples (curl/Node/Python), extraction-service eval integration, Python sidecar wiring, model recommendations by use case, troubleshooting, and env var reference - Models documented: llama3.1:8b (4.9GB, default evals), qwen2.5-coder:32b (19GB, code gen / Swift / TS)
2026-02-19 12:19:44 -08:00 · 2026-02-19 12:19:44 -08:00 · dd23f6cf96
commit dd23f6cf96
parent f0accc0946
1 changed files with 278 additions and 0 deletions
--- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
+++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
@ -0,0 +1,278 @@
+# Local LLM Setup — ByteLyst / LysnrAI / MindLyst
+
+> Everything needed to run local OSS models for development, evals, and offline experimentation.
+> Last updated: 2026-02-19
+
+---
+
+## Overview
+
+We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
+OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
+
+- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
+- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
+- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
+
+---
+
+## Installation
+
+### 1. Install Ollama
+
+```bash
+brew install ollama
+```
+
+Version installed: **0.16.2**
+Binary: `/opt/homebrew/opt/ollama/bin/ollama`
+Models stored at: `~/.ollama/models/`
+
+### 2. Start the server
+
+```bash
+# Option A: foreground (dev)
+ollama serve
+
+# Option B: background service (auto-start at login)
+brew services start ollama
+```
+
+Server listens on: `http://127.0.0.1:11434`
+
+> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
+> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
+> Model downloads go through it — if a pull fails with SSL errors, try:
+>
+> ```bash
+> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
+> ```
+
+### 3. Pull a model
+
+```bash
+ollama pull llama3.1:8b       # recommended default (4.9 GB)
+ollama pull qwen2.5:7b        # strong JSON output (4.7 GB)
+ollama pull phi4               # good reasoning (8.5 GB)
+```
+
+---
+
+## Models Installed
+
+| Model               | Size   | Pull command                    | Notes                                         |
+| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
+| `llama3.1:8b`       | 4.9 GB | `ollama pull llama3.1:8b`       | ✅ Installed — default for evals              |
+| `qwen2.5-coder:32b` | 19 GB  | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
+
+Check installed models:
+
+```bash
+ollama list
+```
+
+---
+
+## Performance on This Machine
+
+- **Hardware:** Apple Silicon Mac (Metal GPU backend)
+- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
+- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1
+- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
+
+---
+
+## OpenAI-Compatible API
+
+Ollama exposes a drop-in OpenAI API:
+
+```
+Base URL:  http://localhost:11434/v1
+API Key:   ollama  (any non-empty string)
+```
+
+### Example: curl
+
+```bash
+curl http://localhost:11434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1:8b",
+    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
+    "response_format": {"type": "json_object"}
+  }'
+```
+
+### Example: Node.js (OpenAI SDK)
+
+```typescript
+import OpenAI from 'openai';
+
+const client = new OpenAI({
+  baseURL: 'http://localhost:11434/v1',
+  apiKey: 'ollama',
+});
+
+const res = await client.chat.completions.create({
+  model: 'llama3.1:8b',
+  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
+  response_format: { type: 'json_object' },
+});
+```
+
+### Example: Python
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
+
+response = client.chat.completions.create(
+    model="llama3.1:8b",
+    messages=[{"role": "user", "content": "Extract action items from: ..."}],
+    response_format={"type": "json_object"},
+)
+```
+
+---
+
+## Extraction Service Evals
+
+The extraction-service has a full promptfoo eval suite that can run against Ollama.
+
+### Files
+
+| File                                                      | Purpose                                            |
+| --------------------------------------------------------- | -------------------------------------------------- |
+| `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
+| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
+| `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
+| `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
+| `services/extraction-service/evals/README.md`             | Full usage docs                                    |
+
+### Running
+
+```bash
+cd services/extraction-service
+
+# Ollama only (no extraction-service needed)
+pnpm eval:ollama
+
+# Different model
+OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
+
+# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
+pnpm eval:compare
+```
+
+### Eval Coverage
+
+| Task                    | Cases | Key assertions                                                  |
+| ----------------------- | ----- | --------------------------------------------------------------- |
+| `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
+| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
+| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
+| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
+| `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |
+
+**Total: 19 cases, 50+ assertions**
+
+### Important: Assertion Pattern
+
+Ollama returns a raw JSON **string** — assertions must parse it inline:
+
+```yaml
+# ✅ Correct
+- type: javascript
+  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
+
+# ❌ Wrong — output is a string, not an object
+- type: javascript
+  value: output.classes.includes('action')
+```
+
+### Cost
+
+- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash)
+- **Ollama (local):** $0.00 — fully offline after model download
+
+---
+
+## Pointing the Python Sidecar at Ollama
+
+The extraction-service Python sidecar (LangExtract) uses Gemini by default.
+To switch to Ollama for local dev, set these env vars before starting the sidecar:
+
+```bash
+export LANGEXTRACT_PROVIDER=openai_compat
+export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
+export LANGEXTRACT_API_KEY=ollama
+export LANGEXTRACT_MODEL=llama3.1:8b
+```
+
+> Check `services/extraction-service/python/` for the exact env var names — the sidecar
+> config may use different keys depending on the LangExtract version.
+
+---
+
+## Recommended Models by Use Case
+
+| Use case                        | Recommended model | Why                                  |
+| ------------------------------- | ----------------- | ------------------------------------ |
+| **Extraction evals (default)**  | `llama3.1:8b`     | Good JSON compliance, fast           |
+| **Better JSON structure**       | `qwen2.5:7b`      | Trained heavily on structured output |
+| **Reasoning / complex triage**  | `phi4`            | Strong reasoning, fits in 9GB        |
+| **Best quality (M2 Max+ only)** | `llama3.3:70b`    | Needs ~40GB RAM                      |
+
+---
+
+## Troubleshooting
+
+**`MLX dynamic library not available`**
+→ Harmless warning. Ollama falls back to Metal. No action needed.
+
+**Model pull fails (SSL / proxy)**
+
+```bash
+NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
+```
+
+**Ollama not responding**
+
+```bash
+# Check if running
+curl http://localhost:11434/api/tags
+
+# Restart
+brew services restart ollama
+# or
+pkill ollama && ollama serve
+```
+
+**JSON parse errors in evals**
+→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
+`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
+
+**Slow inference**
+→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
+
+```bash
+OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
+```
+
+(These flags were shown in the Homebrew install output.)
+
+---
+
+## Environment Variables Reference
+
+| Variable                 | Default                     | Purpose                                          |
+| ------------------------ | --------------------------- | ------------------------------------------------ |
+| `OLLAMA_HOST`            | `http://127.0.0.1:11434`    | Server bind address                              |
+| `OLLAMA_MODELS`          | `~/.ollama/models`          | Model storage path                               |
+| `OLLAMA_KEEP_ALIVE`      | `5m`                        | How long to keep model loaded after last request |
+| `OLLAMA_FLASH_ATTENTION` | `false`                     | Enable flash attention (faster, less RAM)        |
+| `OLLAMA_KV_CACHE_TYPE`   | _(none)_                    | KV cache quantization (`q8_0` = smaller RAM)     |
+| `OLLAMA_NUM_PARALLEL`    | `1`                         | Concurrent requests                              |
+| `OLLAMA_MODEL`           | `llama3.1:8b`               | Model used by `pnpm eval:ollama`                 |
+| `OLLAMA_BASE_URL`        | `http://localhost:11434/v1` | Used by promptfoo ollama config                  |