diff --git a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md index 0cb8288f..92ccaebd 100644 --- a/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md +++ b/__LOCAL_LLMs/LOCAL_LLMs_setup_mac_m4_48gb.md @@ -26,17 +26,18 @@ cd __LOCAL_LLMs/dashboard && npm run dev -- -p 3100 All documentation is now organized in [`docs/`](docs/README.md): -| # | Document | Description | -| --- | ----------------------------------------------------------------- | ------------------------------------------------------------------- | -| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info | -| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API | -| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg | -| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives | -| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features | -| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config | -| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings | -| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes | -| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals | +| # | Document | Description | +| --- | ----------------------------------------------------------------- | -------------------------------------------------------------------- | +| 00 | [Developer Guide](docs/00-developer-guide.md) | **Start here** — API endpoint, code examples, model selection, evals | +| 01 | [Hardware & Prerequisites](docs/01-hardware-and-prerequisites.md) | M4 Pro specs, toolchain, disk/RAM budget, network info | +| 02 | [Ollama Setup & Models](docs/02-ollama-setup-and-models.md) | Installation, server config, model management, memory behavior, API | +| 03 | [Whisper.cpp Setup](docs/03-whisper-cpp-setup.md) | Speech-to-text: installation, models, CLI, streaming, ffmpeg | +| 04 | [Multimodal Local Stack](docs/04-multimodal-local-stack.md) | Vision models, audio pipeline, video status, Kimi alternatives | +| 05 | [Mission Control Dashboard](docs/05-mission-control-dashboard.md) | Next.js dashboard: architecture, API routes, features | +| 06 | [Extraction Service Evals](docs/06-extraction-service-evals.md) | promptfoo suite, Ollama vs Gemini, Python sidecar config | +| 07 | [Model Recommendations](docs/07-model-recommendations.md) | Tiered guide: coding, reasoning, vision, embeddings | +| 08 | [Troubleshooting](docs/08-troubleshooting.md) | Corporate proxy, MLX warnings, common fixes | +| 09 | [Environment Variables](docs/09-environment-variables.md) | All config vars: Ollama, Whisper, dashboard, evals | --- diff --git a/__LOCAL_LLMs/docs/00-developer-guide.md b/__LOCAL_LLMs/docs/00-developer-guide.md new file mode 100644 index 00000000..c861fe7e --- /dev/null +++ b/__LOCAL_LLMs/docs/00-developer-guide.md @@ -0,0 +1,263 @@ +# 00 — Developer Guide: Local LLM with Ollama + +> How to use the local LLM stack for development, evals, and AI-powered features — without cloud API costs or proxy issues. + +--- + +## What Is This? + +This machine runs a local LLM server via [Ollama](https://ollama.com), exposing an **OpenAI-compatible API** at `http://localhost:11434/v1`. You can use it as a drop-in replacement for OpenAI/Gemini/Azure in any code that uses the OpenAI SDK. + +**Models installed:** + +| Model | Size | Best For | +| ------------------- | ------- | ----------------------------------------- | +| `qwen2.5-coder:32b` | 18.5 GB | Code (TS, Python, Swift), structured JSON | +| `llama3.1:8b` | 4.7 GB | Fast evals, general tasks | + +--- + +## Quick Start + +### 1. Check Ollama is running + +```bash +curl http://localhost:11434/api/tags +``` + +If it returns a JSON list of models — you're good. If it fails: + +```bash +ollama serve # start in foreground +# or +brew services start ollama # start as background service +``` + +### 2. List available models + +```bash +ollama list +``` + +### 3. Chat with a model (interactive) + +```bash +ollama run llama3.1:8b +ollama run qwen2.5-coder:32b +``` + +--- + +## API Endpoint Reference + +| Property | Value | +| ------------------- | ------------------------------------- | +| **Base URL** | `http://localhost:11434/v1` | +| **API Key** | `ollama` (any non-empty string works) | +| **Protocol** | OpenAI-compatible REST | +| **Models endpoint** | `http://localhost:11434/api/tags` | +| **Loaded models** | `http://localhost:11434/api/ps` | + +--- + +## Using in Code + +### curl + +```bash +curl http://localhost:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama3.1:8b", + "messages": [{"role": "user", "content": "Return JSON: {\"status\": \"ok\"}"}], + "response_format": {"type": "json_object"} + }' +``` + +### TypeScript / Node.js (OpenAI SDK) + +```typescript +import OpenAI from 'openai'; + +const ollama = new OpenAI({ + baseURL: 'http://localhost:11434/v1', + apiKey: 'ollama', +}); + +const res = await ollama.chat.completions.create({ + model: 'qwen2.5-coder:32b', + messages: [{ role: 'user', content: 'Extract action items from: ...' }], + response_format: { type: 'json_object' }, +}); + +console.log(res.choices[0].message.content); +``` + +### Python (OpenAI SDK) + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:11434/v1", + api_key="ollama", +) + +response = client.chat.completions.create( + model="llama3.1:8b", + messages=[{"role": "user", "content": "Extract action items from: ..."}], + response_format={"type": "json_object"}, +) + +print(response.choices[0].message.content) +``` + +### Environment variable pattern (recommended) + +Instead of hardcoding the URL, use env vars so code works with both local and cloud: + +```bash +# .env.local (local dev) +OPENAI_BASE_URL=http://localhost:11434/v1 +OPENAI_API_KEY=ollama +LLM_MODEL=llama3.1:8b + +# .env.production +OPENAI_BASE_URL=https://api.openai.com/v1 +OPENAI_API_KEY=sk-... +LLM_MODEL=gpt-4o +``` + +```typescript +const client = new OpenAI({ + baseURL: process.env.OPENAI_BASE_URL, + apiKey: process.env.OPENAI_API_KEY, +}); +``` + +--- + +## Running Extraction Service Evals Locally + +The extraction-service has a full 19-case promptfoo eval suite that runs against Ollama directly — no cloud API needed. + +```bash +cd services/extraction-service + +# Run evals with default model (llama3.1:8b) +pnpm eval:ollama + +# Run with a different model +OLLAMA_MODEL=qwen2.5-coder:32b pnpm eval:ollama + +# Run unattended with logging + macOS notification on completion +OLLAMA_MODEL=llama3.1:8b ./evals/run-ollama-evals-logged.sh +``` + +Logs are written to `evals/logs/`. See [06-extraction-service-evals.md](06-extraction-service-evals.md) for full details. + +--- + +## Pointing the Extraction Service Python Sidecar at Ollama + +By default the sidecar uses Gemini. Override with: + +```bash +export LANGEXTRACT_PROVIDER=openai_compat +export LANGEXTRACT_BASE_URL=http://localhost:11434/v1 +export LANGEXTRACT_API_KEY=ollama +export LANGEXTRACT_MODEL=llama3.1:8b +``` + +--- + +## Model Management + +```bash +# Pull a new model +NO_PROXY="ollama.com,registry.ollama.ai" ollama pull deepseek-r1:32b + +# See what's loaded in RAM right now +ollama ps + +# Unload a model from RAM (free up memory) +curl http://localhost:11434/api/generate \ + -d '{"model": "qwen2.5-coder:32b", "prompt": "", "keep_alive": "0"}' + +# Remove a model from disk +ollama rm +``` + +--- + +## Choosing the Right Model + +| Task | Recommended Model | Why | +| -------------------------------- | ------------------- | ----------------------------- | +| TypeScript / Python / Swift code | `qwen2.5-coder:32b` | Best code quality locally | +| Fast evals / iteration | `llama3.1:8b` | 40–60 tok/s, low RAM | +| Structured JSON extraction | `qwen2.5-coder:32b` | Excellent format compliance | +| Complex reasoning / triage | `deepseek-r1:32b` | Chain-of-thought, ~80% of 70B | +| Quick one-off questions | `llama3.1:8b` | Fastest response | + +See [07-model-recommendations.md](07-model-recommendations.md) for the full comparison table. + +--- + +## Important: JSON Output + +Always request JSON mode explicitly — models are more reliable with it: + +```typescript +response_format: { + type: 'json_object'; +} +``` + +When parsing in promptfoo assertions, output is a **raw string** — parse it first: + +```javascript +// ✅ Correct +JSON.parse(output).extractions.map(function(e){ return e.extraction_class }) + +// ❌ Wrong — output is not already an object +output.extractions.map(...) +``` + +### DeepSeek R1 — strip `` blocks + +R1 models emit reasoning traces before JSON. Strip them: + +```typescript +const raw = res.choices[0].message.content; +const json = raw.replace(/[\s\S]*?<\/think>/g, '').trim(); +const result = JSON.parse(json); +``` + +--- + +## Troubleshooting + +| Problem | Fix | +| ------------------------------------------- | ----------------------------------------------------------------------------------- | +| `connection refused` on port 11434 | Run `ollama serve` or `brew services start ollama` | +| Model pull fails / hangs | Use `NO_PROXY="ollama.com,registry.ollama.ai" ollama pull ` | +| `MLX dynamic library not available` warning | Harmless — Metal backend is used automatically | +| Slow responses | Check `ollama ps` — model may be loading cold from disk (first request takes 5–15s) | +| Out of memory | Run `ollama ps` and unload unused models with `keep_alive: "0"` | +| JSON parse error with R1 models | Strip `...` block before parsing | + +See [08-troubleshooting.md](08-troubleshooting.md) for more. + +--- + +## Further Reading + +| Doc | Contents | +| -------------------------------------------------------------------- | ------------------------------------------------------------ | +| [01-hardware-and-prerequisites.md](01-hardware-and-prerequisites.md) | M4 Pro specs, disk/RAM budget | +| [02-ollama-setup-and-models.md](02-ollama-setup-and-models.md) | Installation, server config, memory management | +| [06-extraction-service-evals.md](06-extraction-service-evals.md) | promptfoo eval suite, assertion patterns, latency comparison | +| [07-model-recommendations.md](07-model-recommendations.md) | Full model comparison table, gap analysis vs 70B | +| [08-troubleshooting.md](08-troubleshooting.md) | Common issues and fixes | +| [09-environment-variables.md](09-environment-variables.md) | All config env vars |