saravanakumardb1 b1d2e4ec81 fix(local-llms): cross-platform audit — 8 bugs/gaps fixed

- setup-tts.sh: make fully cross-platform (macOS + Linux/WSL2)
  - OS detection, apt fallback, CUDA PyTorch install, nvidia-smi check
  - cross-platform playback hints, HF_MIRROR env override
- api/system/route.ts: fix ffmpeg detection (use -version not --version)
- api/system/memory/route.ts: remove unused total variable in Linux path
- api/system/exec/route.ts: expand allowlist with Linux commands
  (head, tail, grep, which, ps, uname, free, lscpu, nvidia-smi, etc.)
- api/tts/route.ts: cross-platform venv path + CUDA/MPS label
- api/whisper/route.ts: Linux binary/model paths
- api/ollama/logs/route.ts: Linux log paths + WSL2 hint
- test_qwen_tts.py: platform-aware speech text + CUDA device detection
- test_orpheus_tts.py: platform-aware text, move import sys to top
- setup-guide.md: fix false auto-detect claim, add HF_MIRROR hint

2026-02-21 15:27:49 -08:00

8.8 KiB

Raw Blame History

Windows Setup Guide — Local LLM Stack on Razer Blade 18

Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home + WSL2 (Ubuntu) Goal: Mirror the macOS __LOCAL_LLMs stack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs

Architecture: Windows-Native + WSL2

┌────────────────────────────────────────────────────────┐
│  Windows 11                                            │
│  ├── NVIDIA drivers + CUDA (native)                    │
│  ├── Ollama (native Windows service, port 11434)       │
│  └── Browser → http://localhost:3000                   │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │  WSL2 (Ubuntu 24.04)                             │  │
│  │  ├── Node.js, Python 3.12, ffmpeg, git           │  │
│  │  ├── __LOCAL_LLMs/ (cloned here)                 │  │
│  │  │   ├── dashboard/ → npm run dev (port 3000)    │  │
│  │  │   ├── setup-tts.sh    (works as-is)           │  │
│  │  │   ├── start-dashboard.sh (works as-is)        │  │
│  │  │   └── models/ (SNAC, Qwen3-TTS)              │  │
│  │  ├── whisper-cpp (CUDA build)                    │  │
│  │  └── .venv-qwen-tts/ (PyTorch CUDA)             │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

Why WSL2? All existing bash scripts, Python venvs, and Node.js tooling work identically to macOS — zero porting. The dashboard API routes auto-detect macOS vs Linux at runtime via process.platform.

Phase 1: Windows-Native Setup

1. NVIDIA Drivers

# Install latest NVIDIA Game Ready or Studio drivers
# Download from: https://www.nvidia.com/Download/index.aspx

# Verify
nvidia-smi
# Should show: RTX 5090, 24 GB VRAM, CUDA 13.x+

2. Ollama (Windows-Native)

Ollama runs natively on Windows and is accessible from WSL2 at localhost:11434.

winget install --id Ollama.Ollama

# Verify
ollama --version

3. Pull Models (from Windows or WSL2)

ollama pull qwen2.5-coder:32b     # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b      # 4.7 GB — fast coding
ollama pull deepseek-r1:32b       # 19 GB — chain-of-thought
ollama pull llama3.1:8b            # 4.9 GB — fast general tasks
ollama pull sematre/orpheus:en    # 4 GB — text-to-speech (8 voices)

ollama list    # verify all 5 models

4. Install WSL2

# From PowerShell (Admin)
wsl --install -d Ubuntu-24.04
# Reboot if prompted, then set up username/password

Phase 2: WSL2 Setup

1. Install Dependencies

# Update
sudo apt update && sudo apt upgrade -y

# Node.js 20 LTS
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

# Python 3.12
sudo apt install -y python3.12 python3.12-venv python3-pip

# Build tools + ffmpeg
sudo apt install -y ffmpeg git curl build-essential cmake

# Verify
node --version        # 20.x+
python3.12 --version
nvidia-smi            # should show RTX 5090 (GPU passthrough from Windows)

Important: Do NOT install NVIDIA drivers inside WSL2. The Windows-side driver handles GPU passthrough automatically.

2. Clone Repo

mkdir -p ~/code/mygh && cd ~/code/mygh
git clone https://github.com/saravanakumardb1/learning_ai_common_plat.git
cd learning_ai_common_plat/__LOCAL_LLMs

Performance note: Always clone inside WSL2 filesystem (~/code/...), NOT in /mnt/c/ — the Windows filesystem bridge is very slow for node_modules.

3. Whisper.cpp (CUDA build)

cd ~
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
sudo cp build/bin/whisper-cli /usr/local/bin/

# Download model (1.5 GB)
mkdir -p ~/whisper-models
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"

# Verify
whisper-cli --version

No corporate proxy on this machine — download directly from huggingface.co.

4. TTS Setup (One-Shot)

cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs

# Works exactly like macOS — downloads SNAC, Qwen3-TTS, creates venv
bash setup-tts.sh

The script detects macOS vs Linux and installs the correct PyTorch variant (MPS on macOS, CUDA on Linux). On a personal machine, override the default HuggingFace mirror: HF_MIRROR=https://huggingface.co bash setup-tts.sh

5. Start Dashboard

bash start-dashboard.sh
# Open http://localhost:3000 in Windows browser

WSL2 automatically forwards ports — the dashboard is accessible from Windows at localhost:3000.

Key Differences: macOS vs WSL2

Area	macOS (any Mac)	WSL2 (any Linux)
GPU	Apple Silicon (MPS)	NVIDIA (CUDA)
Ollama	macOS native (Metal)	Windows native, accessed via localhost
PyTorch device	`mps`	`cuda`
Whisper install	`brew install whisper-cpp`	Build from source with CUDA
Package manager	Homebrew	apt
Shell scripts	Work as-is	Work as-is
Python venv path	`bin/python`	`bin/python` (same)
Dashboard	Identical	Identical
Ollama models path	`~/.ollama/models/`	Windows `%USERPROFILE%\.ollama\`
Model download	`hf-mirror.com` (corporate)	`huggingface.co` (direct)

Performance Expectations

Workload	macOS M4 Pro 48 GB	Razer RTX 5090 24 GB
qwen2.5-coder:32b inference	~15–25 tok/s	~40–60 tok/s
Whisper large-v3-turbo	~2–4x realtime	~8–15x realtime
Orpheus TTS	~realtime	~2–3x realtime
Qwen3-TTS	~realtime (MPS)	~2–4x realtime (CUDA)
70B quantized models	Fits in 48 GB (slow)	Partially offloads to 64 GB RAM

VRAM Budget (RTX 5090 — 24 GB)

Model	VRAM Usage	Fits in GPU?
llama3.1:8b	~5 GB	✅ Fully
qwen2.5-coder:7b	~5 GB	✅ Fully
sematre/orpheus:en	~4 GB	✅ Fully
qwen2.5-coder:32b	~19 GB	✅ Fully
deepseek-r1:32b	~19 GB	✅ Fully

Quick Reference — Full Setup Checklist

Windows Side

[ ] Install NVIDIA drivers (Game Ready or Studio)
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull all 5 models
[ ] Install WSL2 (wsl --install -d Ubuntu-24.04)

WSL2 Side

[ ] Install Node.js 20+, Python 3.12, ffmpeg, git, cmake
[ ] Verify nvidia-smi shows RTX 5090
[ ] Clone repo into ~/code/mygh/
[ ] Build whisper-cpp with CUDA
[ ] Download Whisper model to ~/whisper-models/
[ ] Run: bash setup-tts.sh
[ ] Run: bash start-dashboard.sh
[ ] Verify: http://localhost:3000 shows all green

Troubleshooting

Ollama not accessible from WSL2

curl http://localhost:11434/api/tags
# If fails, check Windows firewall or try:
curl http://$(hostname).local:11434/api/tags

CUDA not visible in WSL2

nvidia-smi
# If "command not found":
# 1. Update Windows NVIDIA drivers to latest
# 2. Run: wsl --update
# 3. Do NOT install nvidia-driver-* inside WSL2

Slow filesystem performance

# Clone repos inside WSL2 filesystem: ~/code/...
# NOT in /mnt/c/ (Windows→WSL bridge is ~10x slower for node_modules)

8.8 KiB Raw Blame History Unescape Escape