docs(local-llms): add developer guide — API endpoint, code examples, model selection
- New 00-developer-guide.md: start-here doc for developers covering: - Ollama endpoint (http://localhost:11434/v1) and API key - curl, TypeScript, Python code examples with env var pattern - Model selection table by task - Running extraction service evals locally - JSON output gotchas (parse from string, <think> strip for R1) - Model management commands - Troubleshooting quick reference - Links to all other docs - Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
This commit is contained in:
parent
5deb5efdcf
commit
4090c8aa13
@ -26,17 +26,18 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
|
|||||||
|
|
||||||
All documentation is now organized in [`docs/`](docs/README.md):
|
All documentation is now organized in [`docs/`](docs/README.md):
|
||||||
|
|
||||||
| # | Document | Description |
|
| # | Document | Description |
|
||||||
| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
|
| --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- |
|
||||||
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
|
| 00 | [Developer Guide](docs/00-developer-guide.md) | **Start here** — API endpoint, code examples, model selection, evals |
|
||||||
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
|
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
|
||||||
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
|
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
|
||||||
| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives |
|
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
|
||||||
| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features |
|
| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives |
|
||||||
| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config |
|
| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features |
|
||||||
| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings |
|
| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config |
|
||||||
| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes |
|
| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings |
|
||||||
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals |
|
| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes |
|
||||||
|
| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
263
__LOCAL_LLMs/docs/00-developer-guide.md
Normal file
263
__LOCAL_LLMs/docs/00-developer-guide.md
Normal file
@ -0,0 +1,263 @@
|
|||||||
|
# 00 — Developer Guide: Local LLM with Ollama
|
||||||
|
|
||||||
|
> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Is This?
|
||||||
|
|
||||||
|
This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
|
||||||
|
|
||||||
|
**Models installed:**
|
||||||
|
|
||||||
|
| Model | Size | Best For |
|
||||||
|
| ------------------- | ------- | ----------------------------------------- |
|
||||||
|
| `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON |
|
||||||
|
| `llama3.1:8b` | 4.7 GB | Fast evals, general tasks |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### 1. Check Ollama is running
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:11434/api/tags
|
||||||
|
```
|
||||||
|
|
||||||
|
If it returns a JSON list of models — you're good. If it fails:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama serve # start in foreground
|
||||||
|
# or
|
||||||
|
brew services start ollama # start as background service
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. List available models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama list
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Chat with a model (interactive)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama run llama3.1:8b
|
||||||
|
ollama run qwen2.5-coder:32b
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Endpoint Reference
|
||||||
|
|
||||||
|
| Property | Value |
|
||||||
|
| ------------------- | ------------------------------------- |
|
||||||
|
| **Base URL** | `http://localhost:11434/v1` |
|
||||||
|
| **API Key** | `ollama` (any non-empty string works) |
|
||||||
|
| **Protocol** | OpenAI-compatible REST |
|
||||||
|
| **Models endpoint** | `http://localhost:11434/api/tags` |
|
||||||
|
| **Loaded models** | `http://localhost:11434/api/ps` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Using in Code
|
||||||
|
|
||||||
|
### curl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:11434/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama3.1:8b",
|
||||||
|
"messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
|
||||||
|
"response_format": {"type": "json_object"}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### TypeScript / Node.js (OpenAI SDK)
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
import OpenAI from 'openai';
|
||||||
|
|
||||||
|
const ollama = new OpenAI({
|
||||||
|
baseURL: 'http://localhost:11434/v1',
|
||||||
|
apiKey: 'ollama',
|
||||||
|
});
|
||||||
|
|
||||||
|
const res = await ollama.chat.completions.create({
|
||||||
|
model: 'qwen2.5-coder:32b',
|
||||||
|
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
|
||||||
|
response_format: { type: 'json_object' },
|
||||||
|
});
|
||||||
|
|
||||||
|
console.log(res.choices[0].message.content);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python (OpenAI SDK)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
client = OpenAI(
|
||||||
|
base_url="http://localhost:11434/v1",
|
||||||
|
api_key="ollama",
|
||||||
|
)
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="llama3.1:8b",
|
||||||
|
messages=[{"role": "user", "content": "Extract action items from: ..."}],
|
||||||
|
response_format={"type": "json_object"},
|
||||||
|
)
|
||||||
|
|
||||||
|
print(response.choices[0].message.content)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment variable pattern (recommended)
|
||||||
|
|
||||||
|
Instead of hardcoding the URL, use env vars so code works with both local and cloud:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# .env.local (local dev)
|
||||||
|
OPENAI_BASE_URL=http://localhost:11434/v1
|
||||||
|
OPENAI_API_KEY=ollama
|
||||||
|
LLM_MODEL=llama3.1:8b
|
||||||
|
|
||||||
|
# .env.production
|
||||||
|
OPENAI_BASE_URL=https://api.openai.com/v1
|
||||||
|
OPENAI_API_KEY=sk-...
|
||||||
|
LLM_MODEL=gpt-4o
|
||||||
|
```
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
const client = new OpenAI({
|
||||||
|
baseURL: process.env.OPENAI_BASE_URL,
|
||||||
|
apiKey: process.env.OPENAI_API_KEY,
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running Extraction Service Evals Locally
|
||||||
|
|
||||||
|
The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd services/extraction-service
|
||||||
|
|
||||||
|
# Run evals with default model (llama3.1:8b)
|
||||||
|
pnpm eval:ollama
|
||||||
|
|
||||||
|
# Run with a different model
|
||||||
|
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
|
||||||
|
|
||||||
|
# Run unattended with logging + macOS notification on completion
|
||||||
|
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pointing the Extraction Service Python Sidecar at Ollama
|
||||||
|
|
||||||
|
By default the sidecar uses Gemini. Override with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export LANGEXTRACT_PROVIDER=openai_compat
|
||||||
|
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||||
|
export LANGEXTRACT_API_KEY=ollama
|
||||||
|
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Model Management
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Pull a new model
|
||||||
|
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
|
||||||
|
|
||||||
|
# See what's loaded in RAM right now
|
||||||
|
ollama ps
|
||||||
|
|
||||||
|
# Unload a model from RAM (free up memory)
|
||||||
|
curl http://localhost:11434/api/generate \
|
||||||
|
-d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
|
||||||
|
|
||||||
|
# Remove a model from disk
|
||||||
|
ollama rm <model>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Choosing the Right Model
|
||||||
|
|
||||||
|
| Task | Recommended Model | Why |
|
||||||
|
| -------------------------------- | ------------------- | ----------------------------- |
|
||||||
|
| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally |
|
||||||
|
| Fast evals / iteration | `llama3.1:8b` | 40–60 tok/s, low RAM |
|
||||||
|
| Structured JSON extraction | `qwen2.5-coder:32b` | Excellent format compliance |
|
||||||
|
| Complex reasoning / triage | `deepseek-r1:32b` | Chain-of-thought, ~80% of 70B |
|
||||||
|
| Quick one-off questions | `llama3.1:8b` | Fastest response |
|
||||||
|
|
||||||
|
See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Important: JSON Output
|
||||||
|
|
||||||
|
Always request JSON mode explicitly — models are more reliable with it:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
response_format: {
|
||||||
|
type: 'json_object';
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
When parsing in promptfoo assertions, output is a **raw string** — parse it first:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// ✅ Correct
|
||||||
|
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
|
||||||
|
|
||||||
|
// ❌ Wrong — output is not already an object
|
||||||
|
output.extractions.map(...)
|
||||||
|
```
|
||||||
|
|
||||||
|
### DeepSeek R1 — strip `<think>` blocks
|
||||||
|
|
||||||
|
R1 models emit reasoning traces before JSON. Strip them:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
const raw = res.choices[0].message.content;
|
||||||
|
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
|
||||||
|
const result = JSON.parse(json);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
| Problem | Fix |
|
||||||
|
| ------------------------------------------- | ----------------------------------------------------------------------------------- |
|
||||||
|
| `connection refused` on port 11434 | Run `ollama serve` or `brew services start ollama` |
|
||||||
|
| Model pull fails / hangs | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>` |
|
||||||
|
| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically |
|
||||||
|
| Slow responses | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) |
|
||||||
|
| Out of memory | Run `ollama ps` and unload unused models with `keep_alive: "0"` |
|
||||||
|
| JSON parse error with R1 models | Strip `<think>...</think>` block before parsing |
|
||||||
|
|
||||||
|
See [08-troubleshooting.md](08-troubleshooting.md) for more.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Further Reading
|
||||||
|
|
||||||
|
| Doc | Contents |
|
||||||
|
| -------------------------------------------------------------------- | ------------------------------------------------------------ |
|
||||||
|
| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget |
|
||||||
|
| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md) | Installation, server config, memory management |
|
||||||
|
| [06-extraction-service-evals.md](06-extraction-service-evals.md) | promptfoo eval suite, assertion patterns, latency comparison |
|
||||||
|
| [07-model-recommendations.md](07-model-recommendations.md) | Full model comparison table, gap analysis vs 70B |
|
||||||
|
| [08-troubleshooting.md](08-troubleshooting.md) | Common issues and fixes |
|
||||||
|
| [09-environment-variables.md](09-environment-variables.md) | All config env vars |
|
||||||
Loading…
Reference in New Issue
Block a user