docs(local-llms): add developer guide — API endpoint, code examples, model selection

- New 00-developer-guide.md: start-here doc for developers covering: - Ollama endpoint (http://localhost:11434/v1) and API key - curl, TypeScript, Python code examples with env var pattern - Model selection table by task - Running extraction service evals locally - JSON output gotchas (parse from string, <think> strip for R1) - Model management commands - Troubleshooting quick reference - Links to all other docs - Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
2026-02-19 18:43:06 -08:00 · 2026-02-19 18:43:06 -08:00 · 4090c8aa13
commit 4090c8aa13
parent 5deb5efdcf
2 changed files with 275 additions and 11 deletions
--- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
+++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
@ -27,7 +27,8 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
 All documentation is now organized in [`docs/`](docs/README.md):

 | #   | Document                                                          | Description                                                          |
-| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
+| --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- |
+| 00  | [Developer Guide](docs/00-developer-guide.md)                     | **Start here** — API endpoint, code examples, model selection, evals |
 | 01  | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info               |
 | 02  | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md)       | Installation, server config, model management, memory behavior, API  |
 | 03  | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md)                 | Speech-to-text: installation, models, CLI, streaming, ffmpeg         |
--- a/__LOCAL_LLMs/docs/00-developer-guide.md
+++ b/__LOCAL_LLMs/docs/00-developer-guide.md
@ -0,0 +1,263 @@
+# 00 — Developer Guide: Local LLM with Ollama
+
+> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
+
+---
+
+## What Is This?
+
+This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
+
+**Models installed:**
+
+| Model               | Size    | Best For                                  |
+| ------------------- | ------- | ----------------------------------------- |
+| `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON |
+| `llama3.1:8b`       | 4.7 GB  | Fast evals, general tasks                 |
+
+---
+
+## Quick Start
+
+### 1. Check Ollama is running
+
+```bash
+curl http://localhost:11434/api/tags
+```
+
+If it returns a JSON list of models — you're good. If it fails:
+
+```bash
+ollama serve          # start in foreground
+# or
+brew services start ollama   # start as background service
+```
+
+### 2. List available models
+
+```bash
+ollama list
+```
+
+### 3. Chat with a model (interactive)
+
+```bash
+ollama run llama3.1:8b
+ollama run qwen2.5-coder:32b
+```
+
+---
+
+## API Endpoint Reference
+
+| Property            | Value                                 |
+| ------------------- | ------------------------------------- |
+| **Base URL**        | `http://localhost:11434/v1`           |
+| **API Key**         | `ollama` (any non-empty string works) |
+| **Protocol**        | OpenAI-compatible REST                |
+| **Models endpoint** | `http://localhost:11434/api/tags`     |
+| **Loaded models**   | `http://localhost:11434/api/ps`       |
+
+---
+
+## Using in Code
+
+### curl
+
+```bash
+curl http://localhost:11434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1:8b",
+    "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
+    "response_format": {"type": "json_object"}
+  }'
+```
+
+### TypeScript / Node.js (OpenAI SDK)
+
+```typescript
+import OpenAI from 'openai';
+
+const ollama = new OpenAI({
+  baseURL: 'http://localhost:11434/v1',
+  apiKey: 'ollama',
+});
+
+const res = await ollama.chat.completions.create({
+  model: 'qwen2.5-coder:32b',
+  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
+  response_format: { type: 'json_object' },
+});
+
+console.log(res.choices[0].message.content);
+```
+
+### Python (OpenAI SDK)
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:11434/v1",
+    api_key="ollama",
+)
+
+response = client.chat.completions.create(
+    model="llama3.1:8b",
+    messages=[{"role": "user", "content": "Extract action items from: ..."}],
+    response_format={"type": "json_object"},
+)
+
+print(response.choices[0].message.content)
+```
+
+### Environment variable pattern (recommended)
+
+Instead of hardcoding the URL, use env vars so code works with both local and cloud:
+
+```bash
+# .env.local (local dev)
+OPENAI_BASE_URL=http://localhost:11434/v1
+OPENAI_API_KEY=ollama
+LLM_MODEL=llama3.1:8b
+
+# .env.production
+OPENAI_BASE_URL=https://api.openai.com/v1
+OPENAI_API_KEY=sk-...
+LLM_MODEL=gpt-4o
+```
+
+```typescript
+const client = new OpenAI({
+  baseURL: process.env.OPENAI_BASE_URL,
+  apiKey: process.env.OPENAI_API_KEY,
+});
+```
+
+---
+
+## Running Extraction Service Evals Locally
+
+The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
+
+```bash
+cd services/extraction-service
+
+# Run evals with default model (llama3.1:8b)
+pnpm eval:ollama
+
+# Run with a different model
+OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
+
+# Run unattended with logging + macOS notification on completion
+OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
+```
+
+Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.
+
+---
+
+## Pointing the Extraction Service Python Sidecar at Ollama
+
+By default the sidecar uses Gemini. Override with:
+
+```bash
+export LANGEXTRACT_PROVIDER=openai_compat
+export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
+export LANGEXTRACT_API_KEY=ollama
+export LANGEXTRACT_MODEL=llama3.1:8b
+```
+
+---
+
+## Model Management
+
+```bash
+# Pull a new model
+NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
+
+# See what's loaded in RAM right now
+ollama ps
+
+# Unload a model from RAM (free up memory)
+curl http://localhost:11434/api/generate \
+  -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
+
+# Remove a model from disk
+ollama rm <model>
+```
+
+---
+
+## Choosing the Right Model
+
+| Task                             | Recommended Model   | Why                           |
+| -------------------------------- | ------------------- | ----------------------------- |
+| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally     |
+| Fast evals / iteration           | `llama3.1:8b`       | 40–60 tok/s, low RAM          |
+| Structured JSON extraction       | `qwen2.5-coder:32b` | Excellent format compliance   |
+| Complex reasoning / triage       | `deepseek-r1:32b`   | Chain-of-thought, ~80% of 70B |
+| Quick one-off questions          | `llama3.1:8b`       | Fastest response              |
+
+See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.
+
+---
+
+## Important: JSON Output
+
+Always request JSON mode explicitly — models are more reliable with it:
+
+```typescript
+response_format: {
+  type: 'json_object';
+}
+```
+
+When parsing in promptfoo assertions, output is a **raw string** — parse it first:
+
+```javascript
+// ✅ Correct
+JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
+
+// ❌ Wrong — output is not already an object
+output.extractions.map(...)
+```
+
+### DeepSeek R1 — strip `<think>` blocks
+
+R1 models emit reasoning traces before JSON. Strip them:
+
+```typescript
+const raw = res.choices[0].message.content;
+const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
+const result = JSON.parse(json);
+```
+
+---
+
+## Troubleshooting
+
+| Problem                                     | Fix                                                                                 |
+| ------------------------------------------- | ----------------------------------------------------------------------------------- |
+| `connection refused` on port 11434          | Run `ollama serve` or `brew services start ollama`                                  |
+| Model pull fails / hangs                    | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>`                  |
+| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically                                      |
+| Slow responses                              | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) |
+| Out of memory                               | Run `ollama ps` and unload unused models with `keep_alive: "0"`                     |
+| JSON parse error with R1 models             | Strip `<think>...</think>` block before parsing                                     |
+
+See [08-troubleshooting.md](08-troubleshooting.md) for more.
+
+---
+
+## Further Reading
+
+| Doc                                                                  | Contents                                                     |
+| -------------------------------------------------------------------- | ------------------------------------------------------------ |
+| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget                                |
+| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md)       | Installation, server config, memory management               |
+| [06-extraction-service-evals.md](06-extraction-service-evals.md)     | promptfoo eval suite, assertion patterns, latency comparison |
+| [07-model-recommendations.md](07-model-recommendations.md)           | Full model comparison table, gap analysis vs 70B             |
+| [08-troubleshooting.md](08-troubleshooting.md)                       | Common issues and fixes                                      |
+| [09-environment-variables.md](09-environment-variables.md)           | All config env vars                                          |