docs(local-llms): add developer guide — API endpoint, code examples, model selection
- New 00-developer-guide.md: start-here doc for developers covering: - Ollama endpoint (http://localhost:11434/v1) and API key - curl, TypeScript, Python code examples with env var pattern - Model selection table by task - Running extraction service evals locally - JSON output gotchas (parse from string, <think> strip for R1) - Model management commands - Troubleshooting quick reference - Links to all other docs - Updated index in LOCAL_LLMs_setup_mac_m4_48gb.md to include doc 00
This commit is contained in:
parent
5deb5efdcf
commit
4090c8aa13
@ -27,7 +27,8 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100
|
||||
All documentation is now organized in [`docs/`](docs/README.md):
|
||||
|
||||
| # | Document | Description |
|
||||
| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- |
|
||||
| --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- |
|
||||
| 00 | [Developer Guide](docs/00-developer-guide.md) | **Start here** — API endpoint, code examples, model selection, evals |
|
||||
| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info |
|
||||
| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API |
|
||||
| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg |
|
||||
|
||||
263
__LOCAL_LLMs/docs/00-developer-guide.md
Normal file
263
__LOCAL_LLMs/docs/00-developer-guide.md
Normal file
@ -0,0 +1,263 @@
|
||||
# 00 — Developer Guide: Local LLM with Ollama
|
||||
|
||||
> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues.
|
||||
|
||||
---
|
||||
|
||||
## What Is This?
|
||||
|
||||
This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK.
|
||||
|
||||
**Models installed:**
|
||||
|
||||
| Model | Size | Best For |
|
||||
| ------------------- | ------- | ----------------------------------------- |
|
||||
| `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON |
|
||||
| `llama3.1:8b` | 4.7 GB | Fast evals, general tasks |
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Check Ollama is running
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
If it returns a JSON list of models — you're good. If it fails:
|
||||
|
||||
```bash
|
||||
ollama serve # start in foreground
|
||||
# or
|
||||
brew services start ollama # start as background service
|
||||
```
|
||||
|
||||
### 2. List available models
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
|
||||
### 3. Chat with a model (interactive)
|
||||
|
||||
```bash
|
||||
ollama run llama3.1:8b
|
||||
ollama run qwen2.5-coder:32b
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoint Reference
|
||||
|
||||
| Property | Value |
|
||||
| ------------------- | ------------------------------------- |
|
||||
| **Base URL** | `http://localhost:11434/v1` |
|
||||
| **API Key** | `ollama` (any non-empty string works) |
|
||||
| **Protocol** | OpenAI-compatible REST |
|
||||
| **Models endpoint** | `http://localhost:11434/api/tags` |
|
||||
| **Loaded models** | `http://localhost:11434/api/ps` |
|
||||
|
||||
---
|
||||
|
||||
## Using in Code
|
||||
|
||||
### curl
|
||||
|
||||
```bash
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama3.1:8b",
|
||||
"messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}],
|
||||
"response_format": {"type": "json_object"}
|
||||
}'
|
||||
```
|
||||
|
||||
### TypeScript / Node.js (OpenAI SDK)
|
||||
|
||||
```typescript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const ollama = new OpenAI({
|
||||
baseURL: 'http://localhost:11434/v1',
|
||||
apiKey: 'ollama',
|
||||
});
|
||||
|
||||
const res = await ollama.chat.completions.create({
|
||||
model: 'qwen2.5-coder:32b',
|
||||
messages: [{ role: 'user', content: 'Extract action items from: ...' }],
|
||||
response_format: { type: 'json_object' },
|
||||
});
|
||||
|
||||
console.log(res.choices[0].message.content);
|
||||
```
|
||||
|
||||
### Python (OpenAI SDK)
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:11434/v1",
|
||||
api_key="ollama",
|
||||
)
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llama3.1:8b",
|
||||
messages=[{"role": "user", "content": "Extract action items from: ..."}],
|
||||
response_format={"type": "json_object"},
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
### Environment variable pattern (recommended)
|
||||
|
||||
Instead of hardcoding the URL, use env vars so code works with both local and cloud:
|
||||
|
||||
```bash
|
||||
# .env.local (local dev)
|
||||
OPENAI_BASE_URL=http://localhost:11434/v1
|
||||
OPENAI_API_KEY=ollama
|
||||
LLM_MODEL=llama3.1:8b
|
||||
|
||||
# .env.production
|
||||
OPENAI_BASE_URL=https://api.openai.com/v1
|
||||
OPENAI_API_KEY=sk-...
|
||||
LLM_MODEL=gpt-4o
|
||||
```
|
||||
|
||||
```typescript
|
||||
const client = new OpenAI({
|
||||
baseURL: process.env.OPENAI_BASE_URL,
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running Extraction Service Evals Locally
|
||||
|
||||
The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed.
|
||||
|
||||
```bash
|
||||
cd services/extraction-service
|
||||
|
||||
# Run evals with default model (llama3.1:8b)
|
||||
pnpm eval:ollama
|
||||
|
||||
# Run with a different model
|
||||
OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama
|
||||
|
||||
# Run unattended with logging + macOS notification on completion
|
||||
OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh
|
||||
```
|
||||
|
||||
Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details.
|
||||
|
||||
---
|
||||
|
||||
## Pointing the Extraction Service Python Sidecar at Ollama
|
||||
|
||||
By default the sidecar uses Gemini. Override with:
|
||||
|
||||
```bash
|
||||
export LANGEXTRACT_PROVIDER=openai_compat
|
||||
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
|
||||
export LANGEXTRACT_API_KEY=ollama
|
||||
export LANGEXTRACT_MODEL=llama3.1:8b
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Management
|
||||
|
||||
```bash
|
||||
# Pull a new model
|
||||
NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b
|
||||
|
||||
# See what's loaded in RAM right now
|
||||
ollama ps
|
||||
|
||||
# Unload a model from RAM (free up memory)
|
||||
curl http://localhost:11434/api/generate \
|
||||
-d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}'
|
||||
|
||||
# Remove a model from disk
|
||||
ollama rm <model>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Choosing the Right Model
|
||||
|
||||
| Task | Recommended Model | Why |
|
||||
| -------------------------------- | ------------------- | ----------------------------- |
|
||||
| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally |
|
||||
| Fast evals / iteration | `llama3.1:8b` | 40–60 tok/s, low RAM |
|
||||
| Structured JSON extraction | `qwen2.5-coder:32b` | Excellent format compliance |
|
||||
| Complex reasoning / triage | `deepseek-r1:32b` | Chain-of-thought, ~80% of 70B |
|
||||
| Quick one-off questions | `llama3.1:8b` | Fastest response |
|
||||
|
||||
See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table.
|
||||
|
||||
---
|
||||
|
||||
## Important: JSON Output
|
||||
|
||||
Always request JSON mode explicitly — models are more reliable with it:
|
||||
|
||||
```typescript
|
||||
response_format: {
|
||||
type: 'json_object';
|
||||
}
|
||||
```
|
||||
|
||||
When parsing in promptfoo assertions, output is a **raw string** — parse it first:
|
||||
|
||||
```javascript
|
||||
// ✅ Correct
|
||||
JSON.parse(output).extractions.map(function(e){ return e.extraction_class })
|
||||
|
||||
// ❌ Wrong — output is not already an object
|
||||
output.extractions.map(...)
|
||||
```
|
||||
|
||||
### DeepSeek R1 — strip `<think>` blocks
|
||||
|
||||
R1 models emit reasoning traces before JSON. Strip them:
|
||||
|
||||
```typescript
|
||||
const raw = res.choices[0].message.content;
|
||||
const json = raw.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
|
||||
const result = JSON.parse(json);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Fix |
|
||||
| ------------------------------------------- | ----------------------------------------------------------------------------------- |
|
||||
| `connection refused` on port 11434 | Run `ollama serve` or `brew services start ollama` |
|
||||
| Model pull fails / hangs | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull <model>` |
|
||||
| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically |
|
||||
| Slow responses | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) |
|
||||
| Out of memory | Run `ollama ps` and unload unused models with `keep_alive: "0"` |
|
||||
| JSON parse error with R1 models | Strip `<think>...</think>` block before parsing |
|
||||
|
||||
See [08-troubleshooting.md](08-troubleshooting.md) for more.
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
| Doc | Contents |
|
||||
| -------------------------------------------------------------------- | ------------------------------------------------------------ |
|
||||
| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget |
|
||||
| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md) | Installation, server config, memory management |
|
||||
| [06-extraction-service-evals.md](06-extraction-service-evals.md) | promptfoo eval suite, assertion patterns, latency comparison |
|
||||
| [07-model-recommendations.md](07-model-recommendations.md) | Full model comparison table, gap analysis vs 70B |
|
||||
| [08-troubleshooting.md](08-troubleshooting.md) | Common issues and fixes |
|
||||
| [09-environment-variables.md](09-environment-variables.md) | All config env vars |
|
||||
Loading…
Reference in New Issue
Block a user