docs(local-llms): add developer guide — API endpoint, code examples, model selection

- New 00-developer-guide.md: start-here doc for developers covering:
  - Ollama endpoint (http://localhost:11434/v1) and API key
  - curl, TypeScript, Python code examples with env var pattern
  - Model selection table by task
  - Running extraction service evals locally
  - JSON output gotchas (parse from string, <think> strip for R1)
  - Model management commands
  - Troubleshooting quick reference
  - Links to all other docs
- Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
This commit is contained in:
saravanakumardb1 2026-02-19 18:43:06 -08:00
parent 5deb5efdcf
commit 4090c8aa13
2 changed files with 275 additions and 11 deletions

View File

@ -26,17 +26,18 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
All documentation is now organized in [`docs/`](docs/README.md): All documentation is now organized in [`docs/`](docs/README.md):
| # | Document | Description | | # | Document | Description |
| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- | | --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- |
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info | | 00 | [Developer Guide](docs/00-developer-guide.md) | **Start here** — API endpoint, code examples, model selection, evals |
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API | | 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg | | 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives | | 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features | | 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives |
| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config | | 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features |
| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings | | 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config |
| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes | | 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings |
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals | | 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes |
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals |
--- ---

View File

@ -0,0 +1,263 @@
# 00 — Developer Guide: Local LLM with Ollama
> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
---
## What Is This?
This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
**Models installed:**
| Model | Size | Best For |
| ------------------- | ------- | ----------------------------------------- |
| `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON |
| `llama3.1:8b` | 4.7 GB | Fast evals, general tasks |
---
## Quick Start
### 1. Check Ollama is running
```bash
curl http://localhost:11434/api/tags
```
If it returns a JSON list of models — you're good. If it fails:
```bash
ollama serve # start in foreground
# or
brew services start ollama # start as background service
```
### 2. List available models
```bash
ollama list
```
### 3. Chat with a model (interactive)
```bash
ollama run llama3.1:8b
ollama run qwen2.5-coder:32b
```
---
## API Endpoint Reference
| Property | Value |
| ------------------- | ------------------------------------- |
| **Base URL** | `http://localhost:11434/v1` |
| **API Key** | `ollama` (any non-empty string works) |
| **Protocol** | OpenAI-compatible REST |
| **Models endpoint** | `http://localhost:11434/api/tags` |
| **Loaded models** | `http://localhost:11434/api/ps` |
---
## Using in Code
### curl
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
"response_format": {"type": "json_object"}
}'
```
### TypeScript / Node.js (OpenAI SDK)
```typescript
import OpenAI from 'openai';
const ollama = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const res = await ollama.chat.completions.create({
model: 'qwen2.5-coder:32b',
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
response_format: { type: 'json_object' },
});
console.log(res.choices[0].message.content);
```
### Python (OpenAI SDK)
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Extract action items from: ..."}],
response_format={"type": "json_object"},
)
print(response.choices[0].message.content)
```
### Environment variable pattern (recommended)
Instead of hardcoding the URL, use env vars so code works with both local and cloud:
```bash
# .env.local (local dev)
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama
LLM_MODEL=llama3.1:8b
# .env.production
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
LLM_MODEL=gpt-4o
```
```typescript
const client = new OpenAI({
baseURL: process.env.OPENAI_BASE_URL,
apiKey: process.env.OPENAI_API_KEY,
});
```
---
## Running Extraction Service Evals Locally
The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
```bash
cd services/extraction-service
# Run evals with default model (llama3.1:8b)
pnpm eval:ollama
# Run with a different model
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
# Run unattended with logging + macOS notification on completion
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
```
Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.
---
## Pointing the Extraction Service Python Sidecar at Ollama
By default the sidecar uses Gemini. Override with:
```bash
export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b
```
---
## Model Management
```bash
# Pull a new model
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
# See what's loaded in RAM right now
ollama ps
# Unload a model from RAM (free up memory)
curl http://localhost:11434/api/generate \
-d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
# Remove a model from disk
ollama rm <model>
```
---
## Choosing the Right Model
| Task | Recommended Model | Why |
| -------------------------------- | ------------------- | ----------------------------- |
| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally |
| Fast evals / iteration | `llama3.1:8b` | 4060 tok/s, low RAM |
| Structured JSON extraction | `qwen2.5-coder:32b` | Excellent format compliance |
| Complex reasoning / triage | `deepseek-r1:32b` | Chain-of-thought, ~80% of 70B |
| Quick one-off questions | `llama3.1:8b` | Fastest response |
See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.
---
## Important: JSON Output
Always request JSON mode explicitly — models are more reliable with it:
```typescript
response_format: {
type: 'json_object';
}
```
When parsing in promptfoo assertions, output is a **raw string** — parse it first:
```javascript
// ✅ Correct
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
// ❌ Wrong — output is not already an object
output.extractions.map(...)
```
### DeepSeek R1 — strip `<think>` blocks
R1 models emit reasoning traces before JSON. Strip them:
```typescript
const raw = res.choices[0].message.content;
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
const result = JSON.parse(json);
```
---
## Troubleshooting
| Problem | Fix |
| ------------------------------------------- | ----------------------------------------------------------------------------------- |
| `connection refused` on port 11434 | Run `ollama serve` or `brew services start ollama` |
| Model pull fails / hangs | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>` |
| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically |
| Slow responses | Check `ollama ps` — model may be loading cold from disk (first request takes 515s) |
| Out of memory | Run `ollama ps` and unload unused models with `keep_alive: "0"` |
| JSON parse error with R1 models | Strip `<think>...</think>` block before parsing |
See [08-troubleshooting.md](08-troubleshooting.md) for more.
---
## Further Reading
| Doc | Contents |
| -------------------------------------------------------------------- | ------------------------------------------------------------ |
| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget |
| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md) | Installation, server config, memory management |
| [06-extraction-service-evals.md](06-extraction-service-evals.md) | promptfoo eval suite, assertion patterns, latency comparison |
| [07-model-recommendations.md](07-model-recommendations.md) | Full model comparison table, gap analysis vs 70B |
| [08-troubleshooting.md](08-troubleshooting.md) | Common issues and fixes |
| [09-environment-variables.md](09-environment-variables.md) | All config env vars |