docs(local-llms): add developer guide — API endpoint, code examples, model selection

- New 00-developer-guide.md: start-here doc for developers covering: - Ollama endpoint (http://localhost:11434/v1) and API key - curl, TypeScript, Python code examples with env var pattern - Model selection table by task - Running extraction service evals locally - JSON output gotchas (parse from string, <think> strip for R1) - Model management commands - Troubleshooting quick reference - Links to all other docs - Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
2026-02-19 18:43:06 -08:00 · 2026-02-19 18:43:06 -08:00 · 4090c8aa13
commit 4090c8aa13
parent 5deb5efdcf
2 changed files with 275 additions and 11 deletions
--- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
+++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md
@ -26,17 +26,18 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
 All documentation is now organized in [`docs/`](docs/README.md):
-| #   | Document                                                          | Description                                                         |
+| #   | Document                                                          | Description                                                          |
-| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
+| --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- |
-| 01  | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info              |
+| 00  | [Developer Guide](docs/00-developer-guide.md)                     | **Start here** — API endpoint, code examples, model selection, evals |
-| 02  | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md)       | Installation, server config, model management, memory behavior, API |
+| 01  | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info               |
-| 03  | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md)                 | Speech-to-text: installation, models, CLI, streaming, ffmpeg        |
+| 02  | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md)       | Installation, server config, model management, memory behavior, API  |
-| 04  | [Multimodal Local Stack](docs/04-multimodal-local-stack.md)       | Vision models, audio pipeline, video status, Kimi alternatives      |
+| 03  | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md)                 | Speech-to-text: installation, models, CLI, streaming, ffmpeg         |
-| 05  | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features               |
+| 04  | [Multimodal Local Stack](docs/04-multimodal-local-stack.md)       | Vision models, audio pipeline, video status, Kimi alternatives       |
-| 06  | [Extraction Service Evals](docs/06-extraction-service-evals.md)   | promptfoo suite, Ollama vs Gemini, Python sidecar config            |
+| 05  | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features                |
-| 07  | [Model Recommendations](docs/07-model-recommendations.md)         | Tiered guide: coding, reasoning, vision, embeddings                 |
+| 06  | [Extraction Service Evals](docs/06-extraction-service-evals.md)   | promptfoo suite, Ollama vs Gemini, Python sidecar config             |
-| 08  | [Troubleshooting](docs/08-troubleshooting.md)                     | Corporate proxy, MLX warnings, common fixes                         |
+| 07  | [Model Recommendations](docs/07-model-recommendations.md)         | Tiered guide: coding, reasoning, vision, embeddings                  |
-| 09  | [Environment Variables](docs/09-environment-variables.md)         | All config vars: Ollama, Whisper, dashboard, evals                  |
+| 08  | [Troubleshooting](docs/08-troubleshooting.md)                     | Corporate proxy, MLX warnings, common fixes                          |
 | 09  | [Environment Variables](docs/09-environment-variables.md)         | All config vars: Ollama, Whisper, dashboard, evals                   |
 ---
--- a/__LOCAL_LLMs/docs/00-developer-guide.md
+++ b/__LOCAL_LLMs/docs/00-developer-guide.md
@ -0,0 +1,263 @@
 # 00 — Developer Guide: Local LLM with Ollama
 > How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
 ---
 ## What Is This?
 This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
 **Models installed:**
 | Model               | Size    | Best For                                  |
 | ------------------- | ------- | ----------------------------------------- |
 | `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON |
 | `llama3.1:8b`       | 4.7 GB  | Fast evals, general tasks                 |
 ---
 ## Quick Start
 ### 1. Check Ollama is running
 ```bash
 curl http://localhost:11434/api/tags
 ```
 If it returns a JSON list of models — you're good. If it fails:
 ```bash
 ollama serve          # start in foreground
 # or
 brew services start ollama   # start as background service
 ```
 ### 2. List available models
 ```bash
 ollama list
 ```
 ### 3. Chat with a model (interactive)
 ```bash
 ollama run llama3.1:8b
 ollama run qwen2.5-coder:32b
 ```
 ---
 ## API Endpoint Reference
 | Property            | Value                                 |
 | ------------------- | ------------------------------------- |
 | **Base URL**        | `http://localhost:11434/v1`           |
 | **API Key**         | `ollama` (any non-empty string works) |
 | **Protocol**        | OpenAI-compatible REST                |
 | **Models endpoint** | `http://localhost:11434/api/tags`     |
 | **Loaded models**   | `http://localhost:11434/api/ps`       |
 ---
 ## Using in Code
 ### curl
 ```bash
 curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
    "response_format": {"type": "json_object"}
  }'
 ```
 ### TypeScript / Node.js (OpenAI SDK)
 ```typescript
 import OpenAI from 'openai';
 const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
 });
 const res = await ollama.chat.completions.create({
  model: 'qwen2.5-coder:32b',
  messages: [{ role: 'user', content: 'Extract action items from: ...' }],
  response_format: { type: 'json_object' },
 });
 console.log(res.choices[0].message.content);
 ```
 ### Python (OpenAI SDK)
 ```python
 from openai import OpenAI
 client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
 )
 response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Extract action items from: ..."}],
    response_format={"type": "json_object"},
 )
 print(response.choices[0].message.content)
 ```
 ### Environment variable pattern (recommended)
 Instead of hardcoding the URL, use env vars so code works with both local and cloud:
 ```bash
 # .env.local (local dev)
 OPENAI_BASE_URL=http://localhost:11434/v1
 OPENAI_API_KEY=ollama
 LLM_MODEL=llama3.1:8b
 # .env.production
 OPENAI_BASE_URL=https://api.openai.com/v1
 OPENAI_API_KEY=sk-...
 LLM_MODEL=gpt-4o
 ```
 ```typescript
 const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL,
  apiKey: process.env.OPENAI_API_KEY,
 });
 ```
 ---
 ## Running Extraction Service Evals Locally
 The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
 ```bash
 cd services/extraction-service
 # Run evals with default model (llama3.1:8b)
 pnpm eval:ollama
 # Run with a different model
 OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
 # Run unattended with logging + macOS notification on completion
 OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
 ```
 Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.
 ---
 ## Pointing the Extraction Service Python Sidecar at Ollama
 By default the sidecar uses Gemini. Override with:
 ```bash
 export LANGEXTRACT_PROVIDER=openai_compat
 export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
 export LANGEXTRACT_API_KEY=ollama
 export LANGEXTRACT_MODEL=llama3.1:8b
 ```
 ---
 ## Model Management
 ```bash
 # Pull a new model
 NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
 # See what's loaded in RAM right now
 ollama ps
 # Unload a model from RAM (free up memory)
 curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
 # Remove a model from disk
 ollama rm <model>
 ```
 ---
 ## Choosing the Right Model
 | Task                             | Recommended Model   | Why                           |
 | -------------------------------- | ------------------- | ----------------------------- |
 | TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally     |
 | Fast evals / iteration           | `llama3.1:8b`       | 40–60 tok/s, low RAM          |
 | Structured JSON extraction       | `qwen2.5-coder:32b` | Excellent format compliance   |
 | Complex reasoning / triage       | `deepseek-r1:32b`   | Chain-of-thought, ~80% of 70B |
 | Quick one-off questions          | `llama3.1:8b`       | Fastest response              |
 See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.
 ---
 ## Important: JSON Output
 Always request JSON mode explicitly — models are more reliable with it:
 ```typescript
 response_format: {
  type: 'json_object';
 }
 ```
 When parsing in promptfoo assertions, output is a **raw string** — parse it first:
 ```javascript
 // ✅ Correct
 JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
 // ❌ Wrong — output is not already an object
 output.extractions.map(...)
 ```
 ### DeepSeek R1 — strip `<think>` blocks
 R1 models emit reasoning traces before JSON. Strip them:
 ```typescript
 const raw = res.choices[0].message.content;
 const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
 const result = JSON.parse(json);
 ```
 ---
 ## Troubleshooting
 | Problem                                     | Fix                                                                                 |
 | ------------------------------------------- | ----------------------------------------------------------------------------------- |
 | `connection refused` on port 11434          | Run `ollama serve` or `brew services start ollama`                                  |
 | Model pull fails / hangs                    | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>`                  |
 | `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically                                      |
 | Slow responses                              | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) |
 | Out of memory                               | Run `ollama ps` and unload unused models with `keep_alive: "0"`                     |
 | JSON parse error with R1 models             | Strip `<think>...</think>` block before parsing                                     |
 See [08-troubleshooting.md](08-troubleshooting.md) for more.
 ---
 ## Further Reading
 | Doc                                                                  | Contents                                                     |
 | -------------------------------------------------------------------- | ------------------------------------------------------------ |
 | [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget                                |
 | [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md)       | Installation, server config, memory management               |
 | [06-extraction-service-evals.md](06-extraction-service-evals.md)     | promptfoo eval suite, assertion patterns, latency comparison |
 | [07-model-recommendations.md](07-model-recommendations.md)           | Full model comparison table, gap analysis vs 70B             |
 | [08-troubleshooting.md](08-troubleshooting.md)                       | Common issues and fixes                                      |
 | [09-environment-variables.md](09-environment-variables.md)           | All config env vars                                          |