saravanakumardb1 80f794dee7 docs(local-llm): add Ollama setup, extraction evals, and env vars reference

- docs/02-ollama-setup-and-models.md: installation, server config, memory management,
  idle timeout, manual load/unload, OpenAI-compatible API, native API reference,
  performance tuning flags (flash attention, KV cache)
- docs/06-extraction-service-evals.md: promptfoo eval suite against Ollama, 19 cases
  across 5 tasks, assertion patterns for JSON string output, Python sidecar config
- docs/09-environment-variables.md: comprehensive var reference for Ollama server,
  evals, Python sidecar, dashboard, whisper CLI flags, proxy/network settings

2026-02-19 13:01:05 -08:00

7.0 KiB

Raw Blame History

09 — Environment Variables Reference

All configuration variables for Ollama, Whisper, dashboard, and evals.

Ollama Server

Variable	Default	Purpose
`OLLAMA_HOST`	`http://127.0.0.1:11434`	Server bind address
`OLLAMA_MODELS`	`~/.ollama/models`	Model storage path
`OLLAMA_KEEP_ALIVE`	`5m`	How long to keep model loaded after last request
`OLLAMA_FLASH_ATTENTION`	`false`	Enable flash attention (faster, less RAM)
`OLLAMA_KV_CACHE_TYPE`	(none)	KV cache quantization (`q8_0` = smaller RAM footprint)
`OLLAMA_NUM_PARALLEL`	`1`	Number of concurrent requests
`OLLAMA_MAX_LOADED_MODELS`	`1`	Max models loaded in RAM simultaneously
`OLLAMA_GPU_OVERHEAD`	(auto)	Reserved GPU memory (bytes)
`OLLAMA_ORIGINS`	`*`	Allowed CORS origins
`OLLAMA_DEBUG`	`false`	Enable debug logging
`HTTP_PROXY`	(system)	HTTP proxy for model downloads
`HTTPS_PROXY`	(system)	HTTPS proxy for model downloads
`NO_PROXY`	(none)	Hosts to bypass proxy

Performance Tuning Combo

OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_NUM_PARALLEL=2 \
OLLAMA_KEEP_ALIVE=10m \
ollama serve

Extraction Service Evals (promptfoo)

Variable	Default	Purpose
`OLLAMA_MODEL`	`llama3.1:8b`	Model used by `pnpm eval:ollama`
`OLLAMA_BASE_URL`	`http://localhost:11434/v1`	OpenAI-compat endpoint for promptfoo
`EXTRACTION_EVAL_TOKEN`	(none)	Auth token for extraction-service evals

Usage

# Run evals with a different model
OLLAMA_MODEL=qwen2.5:7b pnpm eval:ollama

# Compare Gemini vs Ollama
EXTRACTION_EVAL_TOKEN=your-token pnpm eval:compare

Python Sidecar (LangExtract)

Variable	Default	Purpose
`LANGEXTRACT_PROVIDER`	`gemini`	Switch to `openai_compat` for Ollama
`LANGEXTRACT_BASE_URL`	(Gemini)	Set to `http://localhost:11434/v1` for Ollama
`LANGEXTRACT_API_KEY`	(Gemini key)	Set to `ollama` for local
`LANGEXTRACT_MODEL`	(Gemini model)	Set to `llama3.1:8b` or preferred model

Switch to Ollama

export LANGEXTRACT_PROVIDER=openai_compat
export LANGEXTRACT_BASE_URL=http://localhost:11434/v1
export LANGEXTRACT_API_KEY=ollama
export LANGEXTRACT_MODEL=llama3.1:8b

Mission Control Dashboard

Variable	Default	Purpose
`OLLAMA_URL`	`http://localhost:11434`	Ollama server URL (used by API routes)
`PORT`	`3100`	Dashboard dev server port

Start with Custom Ollama URL

OLLAMA_URL=http://192.168.1.100:11434 npm run dev -- -p 3100

Whisper.cpp

Whisper.cpp uses CLI flags rather than environment variables:

Flag	Purpose	Example
`--model`	Path to GGML model file	`--model ~/whisper-models/ggml-large-v3-turbo.bin`
`--language`	Input language	`--language en`
`--file`	Audio file path	`--file recording.wav`
`--output-json`	Output in JSON format	`--output-json`
`--output-srt`	Output as SRT subtitles	`--output-srt`
`--output-vtt`	Output as VTT subtitles	`--output-vtt`
`--translate`	Translate to English	`--translate`
`--threads`	Number of CPU threads	`--threads 8`
`--processors`	Number of processors	`--processors 1`
`--print-colors`	Colorize output by confidence	`--print-colors`
`--no-timestamps`	Omit timestamps	`--no-timestamps`
`--port`	Server port (whisper-server)	`--port 8080`

Proxy / Network (Corporate)

Variable	Value on This Machine	Purpose
`HTTP_PROXY`	`http://cso.proxy.att.com:8080/`	Corporate HTTP proxy
`HTTPS_PROXY`	`http://cso.proxy.att.com:8080/`	Corporate HTTPS proxy
`NODE_TLS_REJECT_UNAUTHORIZED`	`0`	Bypass Forcepoint SSL interception for Node.js
`NO_PROXY`	(not set by default)	Add `ollama.com,registry.ollama.ai` if pulls fail

All Paths

Path	Content
`~/.ollama/models/`	Downloaded Ollama models
`~/whisper-models/`	Whisper GGML model files
`/opt/homebrew/bin/ollama`	Ollama binary
`/opt/homebrew/bin/whisper-cli`	Whisper CLI binary
`/opt/homebrew/bin/ffmpeg`	FFmpeg binary
`__LOCAL_LLMs/dashboard/`	Mission Control Next.js app
`__LOCAL_LLMs/docs/`	This documentation
`services/extraction-service/evals/`	Promptfoo eval configs

7.0 KiB Raw Blame History

09 — Environment Variables Reference

Ollama Server

Performance Tuning Combo

Extraction Service Evals (promptfoo)

Usage

Python Sidecar (LangExtract)

Switch to Ollama

Mission Control Dashboard

Start with Custom Ollama URL

Whisper.cpp

Proxy / Network (Corporate)

All Paths

7.0 KiB

Raw Blame History