docs(local-llm): update original setup doc to redirect to docs/ structure

- LOCAL_LLMs_setup_mac_m4_48gb.md: replace 279-line monolith with quick start + documentation index linking to 9 topic-specific docs in docs/ - Add .gitignore for extraction-service eval logs (generated artifacts)
2026-02-19 13:01:35 -08:00 · 2026-02-19 13:01:35 -08:00 · 0c4210f5ff
commit 0c4210f5ff
parent 3561deee52
2 changed files with 31 additions and 257 deletions
--- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
+++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
@ -1,278 +1,51 @@
 # Local LLM Setup — ByteLyst / LysnrAI / MindLyst

-> Everything needed to run local OSS models for development, evals, and offline experimentation.
+> **This file is preserved for reference. Full documentation has moved to [`docs/`](docs/README.md).**
+>
 > Last updated: 2026-02-19

 ---

-## Overview
-
-We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
-OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
-
- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
-
---
-
-## Installation
-
-### 1. Install Ollama
+## Quick Start

 ```bash
-brew install ollama
-```
-
-Version installed: **0.16.2**
-Binary: `/opt/homebrew/opt/ollama/bin/ollama`
-Models stored at: `~/.ollama/models/`
-
-### 2. Start the server
-
-```bash
-# Option A: foreground (dev)
+# 1. Start Ollama
 ollama serve

-# Option B: background service (auto-start at login)
-brew services start ollama
-```
+# 2. Run best coding model
+ollama run qwen2.5-coder:32b

-Server listens on: `http://127.0.0.1:11434`
-
-> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
-> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
-> Model downloads go through it — if a pull fails with SSL errors, try:
->
-> ```bash
-> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
-> ```
-
-### 3. Pull a model
-
-```bash
-ollama pull llama3.1:8b       # recommended default (4.9 GB)
-ollama pull qwen2.5:7b        # strong JSON output (4.7 GB)
-ollama pull phi4               # good reasoning (8.5 GB)
+# 3. Launch Mission Control dashboard
+cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
+# Open http://localhost:3100
 ```

 ---

-## Models Installed
+## Documentation Index

-| Model               | Size   | Pull command                    | Notes                                         |
-| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
-| `llama3.1:8b`       | 4.9 GB | `ollama pull llama3.1:8b`       | ✅ Installed — default for evals              |
-| `qwen2.5-coder:32b` | 19 GB  | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
+All documentation is now organized in [`docs/`](docs/README.md):

-Check installed models:
-
-```bash
-ollama list
-```
+| #   | Document                                                          | Description                                                         |
+| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
+| 01  | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info              |
+| 02  | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md)       | Installation, server config, model management, memory behavior, API |
+| 03  | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md)                 | Speech-to-text: installation, models, CLI, streaming, ffmpeg        |
+| 04  | [Multimodal Local Stack](docs/04-multimodal-local-stack.md)       | Vision models, audio pipeline, video status, Kimi alternatives      |
+| 05  | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features               |
+| 06  | [Extraction Service Evals](docs/06-extraction-service-evals.md)   | promptfoo suite, Ollama vs Gemini, Python sidecar config            |
+| 07  | [Model Recommendations](docs/07-model-recommendations.md)         | Tiered guide: coding, reasoning, vision, embeddings                 |
+| 08  | [Troubleshooting](docs/08-troubleshooting.md)                     | Corporate proxy, MLX warnings, common fixes                         |
+| 09  | [Environment Variables](docs/09-environment-variables.md)         | All config vars: Ollama, Whisper, dashboard, evals                  |

 ---

-## Performance on This Machine
+## Current Status (2026-02-19)

- **Hardware:** Apple Silicon Mac (Metal GPU backend)
- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1
- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
-
---
-
-## OpenAI-Compatible API
-
-Ollama exposes a drop-in OpenAI API:
-
-```
-Base URL:  http://localhost:11434/v1
-API Key:   ollama  (any non-empty string)
-```
-
-### Example: curl
-
-```bash
-curl http://localhost:11434/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "llama3.1:8b",
-    "messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
-    "response_format": {"type": "json_object"}
-  }'
-```
-
-### Example: Node.js (OpenAI SDK)
-
-```typescript
-import OpenAI from 'openai';
-
-const client = new OpenAI({
-  baseURL: 'http://localhost:11434/v1',
-  apiKey: 'ollama',
-});
-
-const res = await client.chat.completions.create({
-  model: 'llama3.1:8b',
-  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
-  response_format: { type: 'json_object' },
-});
-```
-
-### Example: Python
-
-```python
-from openai import OpenAI
-
-client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
-
-response = client.chat.completions.create(
-    model="llama3.1:8b",
-    messages=[{"role": "user", "content": "Extract action items from: ..."}],
-    response_format={"type": "json_object"},
-)
-```
-
---
-
-## Extraction Service Evals
-
-The extraction-service has a full promptfoo eval suite that can run against Ollama.
-
-### Files
-
-| File                                                      | Purpose                                            |
-| --------------------------------------------------------- | -------------------------------------------------- |
-| `services/extraction-service/evals/promptfoo.yaml`        | Gemini evals (via extraction-service HTTP API)     |
-| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly                |
-| `services/extraction-service/evals/compare-evals.sh`      | Side-by-side Gemini vs Ollama pass-rate comparison |
-| `services/extraction-service/evals/fixtures/golden.json`  | Machine-readable golden fixtures                   |
-| `services/extraction-service/evals/README.md`             | Full usage docs                                    |
-
-### Running
-
-```bash
-cd services/extraction-service
-
-# Ollama only (no extraction-service needed)
-pnpm eval:ollama
-
-# Different model
-OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
-
-# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
-pnpm eval:compare
-```
-
-### Eval Coverage
-
-| Task                    | Cases | Key assertions                                                  |
-| ----------------------- | ----- | --------------------------------------------------------------- |
-| `transcript-extraction` | 4     | action_item, deadline, person, decision, question               |
-| `triage`                | 5     | brain_signal routing (health/work/money), emotion valence       |
-| `memory-insight`        | 4     | pattern frequency, relationship, milestone, recurring_theme     |
-| `reflection-enrichment` | 4     | emotional_state valence, accomplishment, concern, goal_progress |
-| `bug-report-extraction` | 2     | all 5 fields, severity level attribute                          |
-
-**Total: 19 cases, 50+ assertions**
-
-### Important: Assertion Pattern
-
-Ollama returns a raw JSON **string** — assertions must parse it inline:
-
-```yaml
-# ✅ Correct
- type: javascript
-  value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
-
-# ❌ Wrong — output is a string, not an object
- type: javascript
-  value: output.classes.includes('action')
-```
-
-### Cost
-
- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash)
- **Ollama (local):** $0.00 — fully offline after model download
-
---
-
-## Pointing the Python Sidecar at Ollama
-
-The extraction-service Python sidecar (LangExtract) uses Gemini by default.
-To switch to Ollama for local dev, set these env vars before starting the sidecar:
-
-```bash
-export LANGEXTRACT_PROVIDER=openai_compat
-export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
-export LANGEXTRACT_API_KEY=ollama
-export LANGEXTRACT_MODEL=llama3.1:8b
-```
-
-> Check `services/extraction-service/python/` for the exact env var names — the sidecar
-> config may use different keys depending on the LangExtract version.
-
---
-
-## Recommended Models by Use Case
-
-| Use case                        | Recommended model | Why                                  |
-| ------------------------------- | ----------------- | ------------------------------------ |
-| **Extraction evals (default)**  | `llama3.1:8b`     | Good JSON compliance, fast           |
-| **Better JSON structure**       | `qwen2.5:7b`      | Trained heavily on structured output |
-| **Reasoning / complex triage**  | `phi4`            | Strong reasoning, fits in 9GB        |
-| **Best quality (M2 Max+ only)** | `llama3.3:70b`    | Needs ~40GB RAM                      |
-
---
-
-## Troubleshooting
-
-**`MLX dynamic library not available`**
-→ Harmless warning. Ollama falls back to Metal. No action needed.
-
-**Model pull fails (SSL / proxy)**
-
-```bash
-NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
-```
-
-**Ollama not responding**
-
-```bash
-# Check if running
-curl http://localhost:11434/api/tags
-
-# Restart
-brew services restart ollama
-# or
-pkill ollama && ollama serve
-```
-
-**JSON parse errors in evals**
-→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
-`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
-
-**Slow inference**
-→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
-
-```bash
-OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
-```
-
-(These flags were shown in the Homebrew install output.)
-
---
-
-## Environment Variables Reference
-
-| Variable                 | Default                     | Purpose                                          |
-| ------------------------ | --------------------------- | ------------------------------------------------ |
-| `OLLAMA_HOST`            | `http://127.0.0.1:11434`    | Server bind address                              |
-| `OLLAMA_MODELS`          | `~/.ollama/models`          | Model storage path                               |
-| `OLLAMA_KEEP_ALIVE`      | `5m`                        | How long to keep model loaded after last request |
-| `OLLAMA_FLASH_ATTENTION` | `false`                     | Enable flash attention (faster, less RAM)        |
-| `OLLAMA_KV_CACHE_TYPE`   | _(none)_                    | KV cache quantization (`q8_0` = smaller RAM)     |
-| `OLLAMA_NUM_PARALLEL`    | `1`                         | Concurrent requests                              |
-| `OLLAMA_MODEL`           | `llama3.1:8b`               | Model used by `pnpm eval:ollama`                 |
-| `OLLAMA_BASE_URL`        | `http://localhost:11434/v1` | Used by promptfoo ollama config                  |
+| Component       | Version        | Status                                                  |
+| --------------- | -------------- | ------------------------------------------------------- |
+| Ollama          | 0.16.2         | ✅ Installed, 2 models (qwen2.5-coder:32b, llama3.1:8b) |
+| whisper-cpp     | 1.8.3          | ✅ Installed, model download pending (proxy blocked)    |
+| ffmpeg          | 8.0.1          | ✅ Installed                                            |
+| Mission Control | Next.js 16     | ✅ Built, runs on :3100                                 |
+| Hardware        | M4 Pro / 48 GB | ✅ Verified                                             |
--- a/services/extraction-service/evals/.gitignore
+++ b/services/extraction-service/evals/.gitignore
@ -0,0 +1 @@
+logs/