diff --git a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md index 803cf0ec..0cb8288f 100644 --- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md +++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md @@ -1,278 +1,51 @@ # Local LLM Setup — ByteLyst / LysnrAI / MindLyst -> Everything needed to run local OSS models for development, evals, and offline experimentation. +> **This file is preserved for reference. Full documentation has moved to [`docs/`](docs/README.md).** +> > Last updated: 2026-02-19 --- -## Overview - -We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an -OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into: - -- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service) -- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini -- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"` - ---- - -## Installation - -### 1. Install Ollama +## Quick Start ```bash -brew install ollama -``` - -Version installed: **0.16.2** -Binary: `/opt/homebrew/opt/ollama/bin/ollama` -Models stored at: `~/.ollama/models/` - -### 2. Start the server - -```bash -# Option A: foreground (dev) +# 1. Start Ollama ollama serve -# Option B: background service (auto-start at login) -brew services start ollama -``` +# 2. Run best coding model +ollama run qwen2.5-coder:32b -Server listens on: `http://127.0.0.1:11434` - -> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment. -> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically. -> Model downloads go through it — if a pull fails with SSL errors, try: -> -> ```bash -> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull -> ``` - -### 3. Pull a model - -```bash -ollama pull llama3.1:8b # recommended default (4.9 GB) -ollama pull qwen2.5:7b # strong JSON output (4.7 GB) -ollama pull phi4 # good reasoning (8.5 GB) +# 3. Launch Mission Control dashboard +cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100 +# Open http://localhost:3100 ``` --- -## Models Installed +## Documentation Index -| Model | Size | Pull command | Notes | -| ------------------- | ------ | ------------------------------- | --------------------------------------------- | -| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals | -| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS | +All documentation is now organized in [`docs/`](docs/README.md): -Check installed models: - -```bash -ollama list -``` +| # | Document | Description | +| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- | +| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info | +| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API | +| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg | +| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives | +| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features | +| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config | +| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings | +| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes | +| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals | --- -## Performance on This Machine +## Current Status (2026-02-19) -- **Hardware:** Apple Silicon Mac (Metal GPU backend) -- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically -- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1 -- **RAM usage:** ~6 GB for llama3.1:8b (unified memory) - ---- - -## OpenAI-Compatible API - -Ollama exposes a drop-in OpenAI API: - -``` -Base URL: http://localhost:11434/v1 -API Key: ollama (any non-empty string) -``` - -### Example: curl - -```bash -curl http://localhost:11434/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llama3.1:8b", - "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}], - "response_format": {"type": "json_object"} - }' -``` - -### Example: Node.js (OpenAI SDK) - -```typescript -import OpenAI from 'openai'; - -const client = new OpenAI({ - baseURL: 'http://localhost:11434/v1', - apiKey: 'ollama', -}); - -const res = await client.chat.completions.create({ - model: 'llama3.1:8b', - messages: [{ role: 'user', content: 'Extract action items from: ...' }], - response_format: { type: 'json_object' }, -}); -``` - -### Example: Python - -```python -from openai import OpenAI - -client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") - -response = client.chat.completions.create( - model="llama3.1:8b", - messages=[{"role": "user", "content": "Extract action items from: ..."}], - response_format={"type": "json_object"}, -) -``` - ---- - -## Extraction Service Evals - -The extraction-service has a full promptfoo eval suite that can run against Ollama. - -### Files - -| File | Purpose | -| --------------------------------------------------------- | -------------------------------------------------- | -| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) | -| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly | -| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison | -| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures | -| `services/extraction-service/evals/README.md` | Full usage docs | - -### Running - -```bash -cd services/extraction-service - -# Ollama only (no extraction-service needed) -pnpm eval:ollama - -# Different model -OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama - -# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN) -pnpm eval:compare -``` - -### Eval Coverage - -| Task | Cases | Key assertions | -| ----------------------- | ----- | --------------------------------------------------------------- | -| `transcript-extraction` | 4 | action_item, deadline, person, decision, question | -| `triage` | 5 | brain_signal routing (health/work/money), emotion valence | -| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme | -| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress | -| `bug-report-extraction` | 2 | all 5 fields, severity level attribute | - -**Total: 19 cases, 50+ assertions** - -### Important: Assertion Pattern - -Ollama returns a raw JSON **string** — assertions must parse it inline: - -```yaml -# ✅ Correct -- type: javascript - value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');" - -# ❌ Wrong — output is a string, not an object -- type: javascript - value: output.classes.includes('action') -``` - -### Cost - -- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash) -- **Ollama (local):** $0.00 — fully offline after model download - ---- - -## Pointing the Python Sidecar at Ollama - -The extraction-service Python sidecar (LangExtract) uses Gemini by default. -To switch to Ollama for local dev, set these env vars before starting the sidecar: - -```bash -export LANGEXTRACT_PROVIDER=openai_compat -export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 -export LANGEXTRACT_API_KEY=ollama -export LANGEXTRACT_MODEL=llama3.1:8b -``` - -> Check `services/extraction-service/python/` for the exact env var names — the sidecar -> config may use different keys depending on the LangExtract version. - ---- - -## Recommended Models by Use Case - -| Use case | Recommended model | Why | -| ------------------------------- | ----------------- | ------------------------------------ | -| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast | -| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output | -| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB | -| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM | - ---- - -## Troubleshooting - -**`MLX dynamic library not available`** -→ Harmless warning. Ollama falls back to Metal. No action needed. - -**Model pull fails (SSL / proxy)** - -```bash -NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b -``` - -**Ollama not responding** - -```bash -# Check if running -curl http://localhost:11434/api/tags - -# Restart -brew services restart ollama -# or -pkill ollama && ollama serve -``` - -**JSON parse errors in evals** -→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt: -`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.` - -**Slow inference** -→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with: - -```bash -OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve -``` - -(These flags were shown in the Homebrew install output.) - ---- - -## Environment Variables Reference - -| Variable | Default | Purpose | -| ------------------------ | --------------------------- | ------------------------------------------------ | -| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address | -| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path | -| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request | -| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) | -| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) | -| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests | -| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` | -| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config | +| Component | Version | Status | +| --------------- | -------------- | ------------------------------------------------------- | +| Ollama | 0.16.2 | ✅ Installed, 2 models (qwen2.5-coder:32b, llama3.1:8b) | +| whisper-cpp | 1.8.3 | ✅ Installed, model download pending (proxy blocked) | +| ffmpeg | 8.0.1 | ✅ Installed | +| Mission Control | Next.js 16 | ✅ Built, runs on :3100 | +| Hardware | M4 Pro / 48 GB | ✅ Verified | diff --git a/services/extraction-service/evals/.gitignore b/services/extraction-service/evals/.gitignore new file mode 100644 index 00000000..333c1e91 --- /dev/null +++ b/services/extraction-service/evals/.gitignore @@ -0,0 +1 @@ +logs/