From dd23f6cf96a72294f8526e7fdad0207e1aed3ba2 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Thu, 19 Feb 2026 12:19:44 -0800 Subject: [PATCH] docs: add local LLM setup guide for Apple Silicon Mac (48GB) - Add __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md: comprehensive reference for running Ollama on the dev Mac covering installation (v0.16.2 via brew), corp proxy handling (AT&T Forcepoint), OpenAI-compat API usage examples (curl/Node/Python), extraction-service eval integration, Python sidecar wiring, model recommendations by use case, troubleshooting, and env var reference - Models documented: llama3.1:8b (4.9GB, default evals), qwen2.5-coder:32b (19GB, code gen / Swift / TS) --- __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md | 278 +++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 __LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md diff --git a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md new file mode 100644 index 00000000..803cf0ec --- /dev/null +++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md @@ -0,0 +1,278 @@ +# Local LLM Setup — ByteLyst / LysnrAI / MindLyst + +> Everything needed to run local OSS models for development, evals, and offline experimentation. +> Last updated: 2026-02-19 + +--- + +## Overview + +We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an +OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into: + +- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service) +- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini +- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"` + +--- + +## Installation + +### 1. Install Ollama + +```bash +brew install ollama +``` + +Version installed: **0.16.2** +Binary: `/opt/homebrew/opt/ollama/bin/ollama` +Models stored at: `~/.ollama/models/` + +### 2. Start the server + +```bash +# Option A: foreground (dev) +ollama serve + +# Option B: background service (auto-start at login) +brew services start ollama +``` + +Server listens on: `http://127.0.0.1:11434` + +> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. +> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. +> Model downloads go through it — if a pull fails with SSL errors, try: +> +> ```bash +> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull +> ``` + +### 3. Pull a model + +```bash +ollama pull llama3.1:8b # recommended default (4.9 GB) +ollama pull qwen2.5:7b # strong JSON output (4.7 GB) +ollama pull phi4 # good reasoning (8.5 GB) +``` + +--- + +## Models Installed + +| Model | Size | Pull command | Notes | +| ------------------- | ------ | ------------------------------- | --------------------------------------------- | +| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals | +| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS | + +Check installed models: + +```bash +ollama list +``` + +--- + +## Performance on This Machine + +- **Hardware:** Apple Silicon Mac (Metal GPU backend) +- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically +- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1 +- **RAM usage:** ~6 GB for llama3.1:8b (unified memory) + +--- + +## OpenAI-Compatible API + +Ollama exposes a drop-in OpenAI API: + +``` +Base URL: http://localhost:11434/v1 +API Key: ollama (any non-empty string) +``` + +### Example: curl + +```bash +curl http://localhost:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama3.1:8b", + "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}], + "response_format": {"type": "json_object"} + }' +``` + +### Example: Node.js (OpenAI SDK) + +```typescript +import OpenAI from 'openai'; + +const client = new OpenAI({ + baseURL: 'http://localhost:11434/v1', + apiKey: 'ollama', +}); + +const res = await client.chat.completions.create({ + model: 'llama3.1:8b', + messages: [{ role: 'user', content: 'Extract action items from: ...' }], + response_format: { type: 'json_object' }, +}); +``` + +### Example: Python + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") + +response = client.chat.completions.create( + model="llama3.1:8b", + messages=[{"role": "user", "content": "Extract action items from: ..."}], + response_format={"type": "json_object"}, +) +``` + +--- + +## Extraction Service Evals + +The extraction-service has a full promptfoo eval suite that can run against Ollama. + +### Files + +| File | Purpose | +| --------------------------------------------------------- | -------------------------------------------------- | +| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) | +| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly | +| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison | +| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures | +| `services/extraction-service/evals/README.md` | Full usage docs | + +### Running + +```bash +cd services/extraction-service + +# Ollama only (no extraction-service needed) +pnpm eval:ollama + +# Different model +OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama + +# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN) +pnpm eval:compare +``` + +### Eval Coverage + +| Task | Cases | Key assertions | +| ----------------------- | ----- | --------------------------------------------------------------- | +| `transcript-extraction` | 4 | action_item, deadline, person, decision, question | +| `triage` | 5 | brain_signal routing (health/work/money), emotion valence | +| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | +| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | +| `bug-report-extraction` | 2 | all 5 fields, severity level attribute | + +**Total: 19 cases, 50+ assertions** + +### Important: Assertion Pattern + +Ollama returns a raw JSON **string** — assertions must parse it inline: + +```yaml +# ✅ Correct +- type: javascript + value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');" + +# ❌ Wrong — output is a string, not an object +- type: javascript + value: output.classes.includes('action') +``` + +### Cost + +- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash) +- **Ollama (local):** $0.00 — fully offline after model download + +--- + +## Pointing the Python Sidecar at Ollama + +The extraction-service Python sidecar (LangExtract) uses Gemini by default. +To switch to Ollama for local dev, set these env vars before starting the sidecar: + +```bash +export LANGEXTRACT_PROVIDER=openai_compat +export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 +export LANGEXTRACT_API_KEY=ollama +export LANGEXTRACT_MODEL=llama3.1:8b +``` + +> Check `services/extraction-service/python/` for the exact env var names — the sidecar +> config may use different keys depending on the LangExtract version. + +--- + +## Recommended Models by Use Case + +| Use case | Recommended model | Why | +| ------------------------------- | ----------------- | ------------------------------------ | +| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast | +| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output | +| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB | +| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM | + +--- + +## Troubleshooting + +**`MLX dynamic library not available`** +→ Harmless warning. Ollama falls back to Metal. No action needed. + +**Model pull fails (SSL / proxy)** + +```bash +NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b +``` + +**Ollama not responding** + +```bash +# Check if running +curl http://localhost:11434/api/tags + +# Restart +brew services restart ollama +# or +pkill ollama && ollama serve +``` + +**JSON parse errors in evals** +→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt: +`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.` + +**Slow inference** +→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with: + +```bash +OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve +``` + +(These flags were shown in the Homebrew install output.) + +--- + +## Environment Variables Reference + +| Variable | Default | Purpose | +| ------------------------ | --------------------------- | ------------------------------------------------ | +| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address | +| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path | +| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request | +| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) | +| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) | +| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests | +| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` | +| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config |