docs(local-llm): update original setup doc to redirect to docs/ structure
- LOCAL_LLMs_setup_mac_m4_48gb.md: replace 279-line monolith with quick start + documentation index linking to 9 topic-specific docs in docs/ - Add .gitignore for extraction-service eval logs (generated artifacts)
This commit is contained in:
parent
3561deee52
commit
0c4210f5ff
@ -1,278 +1,51 @@
|
||||
# Local LLM Setup — ByteLyst / LysnrAI / MindLyst
|
||||
|
||||
> Everything needed to run local OSS models for development, evals, and offline experimentation.
|
||||
> **This file is preserved for reference. Full documentation has moved to [`docs/`](docs/README.md).**
|
||||
>
|
||||
> Last updated: 2026-02-19
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
We use **Ollama** to run local LLMs on the dev Mac (Apple Silicon). Ollama exposes an
|
||||
OpenAI-compatible API at `http://localhost:11434/v1`, which plugs directly into:
|
||||
|
||||
- **promptfoo** evals (`evals/promptfoo.ollama.yaml` in extraction-service)
|
||||
- **Python sidecar** (LangExtract) — can be pointed at Ollama instead of Gemini
|
||||
- Any OpenAI SDK client — just change `baseURL` and `apiKey: "ollama"`
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Install Ollama
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
Version installed: **0.16.2**
|
||||
Binary: `/opt/homebrew/opt/ollama/bin/ollama`
|
||||
Models stored at: `~/.ollama/models/`
|
||||
|
||||
### 2. Start the server
|
||||
|
||||
```bash
|
||||
# Option A: foreground (dev)
|
||||
# 1. Start Ollama
|
||||
ollama serve
|
||||
|
||||
# Option B: background service (auto-start at login)
|
||||
brew services start ollama
|
||||
```
|
||||
# 2. Run best coding model
|
||||
ollama run qwen2.5-coder:32b
|
||||
|
||||
Server listens on: `http://127.0.0.1:11434`
|
||||
|
||||
> **Corporate proxy note:** Ollama auto-detects `HTTP_PROXY` / `HTTPS_PROXY` from the environment.
|
||||
> On this machine, the AT&T Forcepoint proxy (`http://cso.proxy.att.com:8080/`) is picked up automatically.
|
||||
> Model downloads go through it — if a pull fails with SSL errors, try:
|
||||
>
|
||||
> ```bash
|
||||
> NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>
|
||||
> ```
|
||||
|
||||
### 3. Pull a model
|
||||
|
||||
```bash
|
||||
ollama pull llama3.1:8b # recommended default (4.9 GB)
|
||||
ollama pull qwen2.5:7b # strong JSON output (4.7 GB)
|
||||
ollama pull phi4 # good reasoning (8.5 GB)
|
||||
# 3. Launch Mission Control dashboard
|
||||
cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
|
||||
# Open http://localhost:3100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Models Installed
|
||||
## Documentation Index
|
||||
|
||||
| Model | Size | Pull command | Notes |
|
||||
| ------------------- | ------ | ------------------------------- | --------------------------------------------- |
|
||||
| `llama3.1:8b` | 4.9 GB | `ollama pull llama3.1:8b` | ✅ Installed — default for evals |
|
||||
| `qwen2.5-coder:32b` | 19 GB | `ollama pull qwen2.5-coder:32b` | ✅ Installed — best for code gen / Swift / TS |
|
||||
All documentation is now organized in [`docs/`](docs/README.md):
|
||||
|
||||
Check installed models:
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
| # | Document | Description |
|
||||
| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
|
||||
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
|
||||
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
|
||||
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
|
||||
| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives |
|
||||
| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features |
|
||||
| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config |
|
||||
| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings |
|
||||
| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes |
|
||||
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals |
|
||||
|
||||
---
|
||||
|
||||
## Performance on This Machine
|
||||
## Current Status (2026-02-19)
|
||||
|
||||
- **Hardware:** Apple Silicon Mac (Metal GPU backend)
|
||||
- **MLX warning:** `MLX dynamic library not available` — harmless, falls back to Metal/CPU automatically
|
||||
- **Inference speed:** ~30–50 tok/s on M2/M3, ~10–15 tok/s on M1
|
||||
- **RAM usage:** ~6 GB for llama3.1:8b (unified memory)
|
||||
|
||||
---
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
Ollama exposes a drop-in OpenAI API:
|
||||
|
||||
```
|
||||
Base URL: http://localhost:11434/v1
|
||||
API Key: ollama (any non-empty string)
|
||||
```
|
||||
|
||||
### Example: curl
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.1:8b",
|
||||
"messages": [{"role": "user", "content": "Return JSON: {\"hello\": \"world\"}"}],
|
||||
"response_format": {"type": "json_object"}
|
||||
}'
|
||||
```
|
||||
|
||||
### Example: Node.js (OpenAI SDK)
|
||||
|
||||
```typescript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const client = new OpenAI({
|
||||
baseURL: 'http://localhost:11434/v1',
|
||||
apiKey: 'ollama',
|
||||
});
|
||||
|
||||
const res = await client.chat.completions.create({
|
||||
model: 'llama3.1:8b',
|
||||
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
|
||||
response_format: { type: 'json_object' },
|
||||
});
|
||||
```
|
||||
|
||||
### Example: Python
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llama3.1:8b",
|
||||
messages=[{"role": "user", "content": "Extract action items from: ..."}],
|
||||
response_format={"type": "json_object"},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extraction Service Evals
|
||||
|
||||
The extraction-service has a full promptfoo eval suite that can run against Ollama.
|
||||
|
||||
### Files
|
||||
|
||||
| File | Purpose |
|
||||
| --------------------------------------------------------- | -------------------------------------------------- |
|
||||
| `services/extraction-service/evals/promptfoo.yaml` | Gemini evals (via extraction-service HTTP API) |
|
||||
| `services/extraction-service/evals/promptfoo.ollama.yaml` | Same 19 cases, hits Ollama directly |
|
||||
| `services/extraction-service/evals/compare-evals.sh` | Side-by-side Gemini vs Ollama pass-rate comparison |
|
||||
| `services/extraction-service/evals/fixtures/golden.json` | Machine-readable golden fixtures |
|
||||
| `services/extraction-service/evals/README.md` | Full usage docs |
|
||||
|
||||
### Running
|
||||
|
||||
```bash
|
||||
cd services/extraction-service
|
||||
|
||||
# Ollama only (no extraction-service needed)
|
||||
pnpm eval:ollama
|
||||
|
||||
# Different model
|
||||
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama
|
||||
|
||||
# Compare Gemini vs Ollama (needs extraction-service running + EXTRACTION_EVAL_TOKEN)
|
||||
pnpm eval:compare
|
||||
```
|
||||
|
||||
### Eval Coverage
|
||||
|
||||
| Task | Cases | Key assertions |
|
||||
| ----------------------- | ----- | --------------------------------------------------------------- |
|
||||
| `transcript-extraction` | 4 | action_item, deadline, person, decision, question |
|
||||
| `triage` | 5 | brain_signal routing (health/work/money), emotion valence |
|
||||
| `memory-insight` | 4 | pattern frequency, relationship, milestone, recurring_theme |
|
||||
| `reflection-enrichment` | 4 | emotional_state valence, accomplishment, concern, goal_progress |
|
||||
| `bug-report-extraction` | 2 | all 5 fields, severity level attribute |
|
||||
|
||||
**Total: 19 cases, 50+ assertions**
|
||||
|
||||
### Important: Assertion Pattern
|
||||
|
||||
Ollama returns a raw JSON **string** — assertions must parse it inline:
|
||||
|
||||
```yaml
|
||||
# ✅ Correct
|
||||
- type: javascript
|
||||
value: "const r=JSON.parse(output); return r.extractions.map(e=>e.extraction_class).includes('action');"
|
||||
|
||||
# ❌ Wrong — output is a string, not an object
|
||||
- type: javascript
|
||||
value: output.classes.includes('action')
|
||||
```
|
||||
|
||||
### Cost
|
||||
|
||||
- **Gemini (via extraction-service):** ~$0.003–0.005 per full run (gemini-2.5-flash)
|
||||
- **Ollama (local):** $0.00 — fully offline after model download
|
||||
|
||||
---
|
||||
|
||||
## Pointing the Python Sidecar at Ollama
|
||||
|
||||
The extraction-service Python sidecar (LangExtract) uses Gemini by default.
|
||||
To switch to Ollama for local dev, set these env vars before starting the sidecar:
|
||||
|
||||
```bash
|
||||
export LANGEXTRACT_PROVIDER=openai_compat
|
||||
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||
export LANGEXTRACT_API_KEY=ollama
|
||||
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||
```
|
||||
|
||||
> Check `services/extraction-service/python/` for the exact env var names — the sidecar
|
||||
> config may use different keys depending on the LangExtract version.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Models by Use Case
|
||||
|
||||
| Use case | Recommended model | Why |
|
||||
| ------------------------------- | ----------------- | ------------------------------------ |
|
||||
| **Extraction evals (default)** | `llama3.1:8b` | Good JSON compliance, fast |
|
||||
| **Better JSON structure** | `qwen2.5:7b` | Trained heavily on structured output |
|
||||
| **Reasoning / complex triage** | `phi4` | Strong reasoning, fits in 9GB |
|
||||
| **Best quality (M2 Max+ only)** | `llama3.3:70b` | Needs ~40GB RAM |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`MLX dynamic library not available`**
|
||||
→ Harmless warning. Ollama falls back to Metal. No action needed.
|
||||
|
||||
**Model pull fails (SSL / proxy)**
|
||||
|
||||
```bash
|
||||
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull llama3.1:8b
|
||||
```
|
||||
|
||||
**Ollama not responding**
|
||||
|
||||
```bash
|
||||
# Check if running
|
||||
curl http://localhost:11434/api/tags
|
||||
|
||||
# Restart
|
||||
brew services restart ollama
|
||||
# or
|
||||
pkill ollama && ollama serve
|
||||
```
|
||||
|
||||
**JSON parse errors in evals**
|
||||
→ Model returned markdown-wrapped JSON (`json ... `). Add to prompt:
|
||||
`Return ONLY a valid JSON object — no markdown, no backticks, no explanation.`
|
||||
|
||||
**Slow inference**
|
||||
→ Check Activity Monitor — Ollama should be using GPU (Metal). If CPU-only, restart with:
|
||||
|
||||
```bash
|
||||
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
|
||||
```
|
||||
|
||||
(These flags were shown in the Homebrew install output.)
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables Reference
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| ------------------------ | --------------------------- | ------------------------------------------------ |
|
||||
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | Server bind address |
|
||||
| `OLLAMA_MODELS` | `~/.ollama/models` | Model storage path |
|
||||
| `OLLAMA_KEEP_ALIVE` | `5m` | How long to keep model loaded after last request |
|
||||
| `OLLAMA_FLASH_ATTENTION` | `false` | Enable flash attention (faster, less RAM) |
|
||||
| `OLLAMA_KV_CACHE_TYPE` | _(none)_ | KV cache quantization (`q8_0` = smaller RAM) |
|
||||
| `OLLAMA_NUM_PARALLEL` | `1` | Concurrent requests |
|
||||
| `OLLAMA_MODEL` | `llama3.1:8b` | Model used by `pnpm eval:ollama` |
|
||||
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Used by promptfoo ollama config |
|
||||
| Component | Version | Status |
|
||||
| --------------- | -------------- | ------------------------------------------------------- |
|
||||
| Ollama | 0.16.2 | ✅ Installed, 2 models (qwen2.5-coder:32b, llama3.1:8b) |
|
||||
| whisper-cpp | 1.8.3 | ✅ Installed, model download pending (proxy blocked) |
|
||||
| ffmpeg | 8.0.1 | ✅ Installed |
|
||||
| Mission Control | Next.js 16 | ✅ Built, runs on :3100 |
|
||||
| Hardware | M4 Pro / 48 GB | ✅ Verified |
|
||||
|
||||
1
services/extraction-service/evals/.gitignore
vendored
Normal file
1
services/extraction-service/evals/.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
logs/
|
||||
Loading…
Reference in New Issue
Block a user