docs(local-llm): update original setup doc to redirect to docs/ structure

- LOCAL_LLMs_setup_mac_m4_48gb.md: replace 279-line monolith with quick start
  + documentation index linking to 9 topic-specific docs in docs/
- Add .gitignore for extraction-service eval logs (generated artifacts)
This commit is contained in:
saravanakumardb1 2026-02-19 13:01:35 -08:00
parent 3561deee52
commit 0c4210f5ff
2 changed files with 31 additions and 257 deletions

View File

@ -1,278 +1,51 @@
# Local LLM Setup — ByteLyst / LysnrAI / MindLyst
> Everything needed to run local OSS models for development, evals, and offline experimentation.
> **This file is preserved for reference. Full documentation has moved to [`docs/`](docs/README.md).**
>
> Last updated: 2026-02-19
---
## Overview
We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
---
## Installation
### 1. Install Ollama
## Quick Start
```bash
brew install ollama
```
Version installed: **0.16.2**
Binary: `/opt/homebrew/opt/ollama/bin/ollama`
Models stored at: `~/.ollama/models/`
### 2. Start the server
```bash
# Option A: foreground (dev)
# 1. Start Ollama
ollama serve
# Option B: background service (auto-start at login)
brew services start ollama
```
# 2. Run best coding model
ollama run qwen2.5-coder:32b
Server listens on: `http://127.0.0.1:11434`
> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
> Model downloads go through it — if a pull fails with SSL errors, try:
>
> ```bash
> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
> ```
### 3. Pull a model
```bash
ollama pull llama3.1:8b # recommended default (4.9 GB)
ollama pull qwen2.5:7b # strong JSON output (4.7 GB)
ollama pull phi4 # good reasoning (8.5 GB)
# 3. Launch Mission Control dashboard
cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
# Open http://localhost:3100
```
---
## Models Installed
## Documentation Index
| Model | Size | Pull command | Notes |
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals |
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
All documentation is now organized in [`docs/`](docs/README.md):
Check installed models:
```bash
ollama list
```
| # | Document | Description |
| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives |
| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features |
| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config |
| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings |
| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes |
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals |
---
## Performance on This Machine
## Current Status (2026-02-19)
- **Hardware:** Apple Silicon Mac (Metal GPU backend)
- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
- **Inference speed:** ~3050 tok/s on M2/M3, ~1015 tok/s on M1
- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
---
## OpenAI-Compatible API
Ollama exposes a drop-in OpenAI API:
```
Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string)
```
### Example: curl
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
"response_format": {"type": "json_object"}
}'
```
### Example: Node.js (OpenAI SDK)
```typescript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await client.chat.completions.create({
model: 'llama3.1:8b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
```
### Example: Python
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
```
---
## Extraction Service Evals
The extraction-service has a full promptfoo eval suite that can run against Ollama.
### Files
| File | Purpose |
| --------------------------------------------------------- | -------------------------------------------------- |
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
| `services/extraction-service/evals/README.md` | Full usage docs |
### Running
```bash
cd services/extraction-service
# Ollama only (no extraction-service needed)
pnpm eval:ollama
# Different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
pnpm eval:compare
```
### Eval Coverage
| Task | Cases | Key assertions |
| ----------------------- | ----- | --------------------------------------------------------------- |
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
**Total: 19 cases, 50+ assertions**
### Important: Assertion Pattern
Ollama returns a raw JSON **string** — assertions must parse it inline:
```yaml
# ✅ Correct
- type: javascript
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
# ❌ Wrong — output is a string, not an object
- type: javascript
value: output.classes.includes('action')
```
### Cost
- **Gemini (via extraction-service):** ~$0.0030.005 per full run (gemini-2.5-flash)
- **Ollama (local):** $0.00 — fully offline after model download
---
## Pointing the Python Sidecar at Ollama
The extraction-service Python sidecar (LangExtract) uses Gemini by default.
To switch to Ollama for local dev, set these env vars before starting the sidecar:
```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```
> Check `services/extraction-service/python/` for the exact env var names — the sidecar
> config may use different keys depending on the LangExtract version.
---
## Recommended Models by Use Case
| Use case | Recommended model | Why |
| ------------------------------- | ----------------- | ------------------------------------ |
| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast |
| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output |
| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB |
| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM |
---
## Troubleshooting
**`MLX dynamic library not available`**
→ Harmless warning. Ollama falls back to Metal. No action needed.
**Model pull fails (SSL / proxy)**
```bash
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
```
**Ollama not responding**
```bash
# Check if running
curl http://localhost:11434/api/tags
# Restart
brew services restart ollama
# or
pkill ollama && ollama serve
```
**JSON parse errors in evals**
→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
**Slow inference**
→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
```bash
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
```
(These flags were shown in the Homebrew install output.)
---
## Environment Variables Reference
| Variable | Default | Purpose |
| ------------------------ | --------------------------- | ------------------------------------------------ |
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) |
| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests |
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config |
| Component | Version | Status |
| --------------- | -------------- | ------------------------------------------------------- |
| Ollama | 0.16.2 | ✅ Installed, 2 models (qwen2.5-coder:32b, llama3.1:8b) |
| whisper-cpp | 1.8.3 | ✅ Installed, model download pending (proxy blocked) |
| ffmpeg | 8.0.1 | ✅ Installed |
| Mission Control | Next.js 16 | ✅ Built, runs on :3100 |
| Hardware | M4 Pro / 48 GB | ✅ Verified |

View File

@ -0,0 +1 @@
logs/