- setup-tts.sh: make fully cross-platform (macOS + Linux/WSL2) - OS detection, apt fallback, CUDA PyTorch install, nvidia-smi check - cross-platform playback hints, HF_MIRROR env override - api/system/route.ts: fix ffmpeg detection (use -version not --version) - api/system/memory/route.ts: remove unused total variable in Linux path - api/system/exec/route.ts: expand allowlist with Linux commands (head, tail, grep, which, ps, uname, free, lscpu, nvidia-smi, etc.) - api/tts/route.ts: cross-platform venv path + CUDA/MPS label - api/whisper/route.ts: Linux binary/model paths - api/ollama/logs/route.ts: Linux log paths + WSL2 hint - test_qwen_tts.py: platform-aware speech text + CUDA device detection - test_orpheus_tts.py: platform-aware text, move import sys to top - setup-guide.md: fix false auto-detect claim, add HF_MIRROR hint
8.8 KiB
Windows Setup Guide — Local LLM Stack on Razer Blade 18
Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home + WSL2 (Ubuntu) Goal: Mirror the macOS
__LOCAL_LLMsstack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs
Architecture: Windows-Native + WSL2
┌────────────────────────────────────────────────────────┐
│ Windows 11 │
│ ├── NVIDIA drivers + CUDA (native) │
│ ├── Ollama (native Windows service, port 11434) │
│ └── Browser → http://localhost:3000 │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ WSL2 (Ubuntu 24.04) │ │
│ │ ├── Node.js, Python 3.12, ffmpeg, git │ │
│ │ ├── __LOCAL_LLMs/ (cloned here) │ │
│ │ │ ├── dashboard/ → npm run dev (port 3000) │ │
│ │ │ ├── setup-tts.sh (works as-is) │ │
│ │ │ ├── start-dashboard.sh (works as-is) │ │
│ │ │ └── models/ (SNAC, Qwen3-TTS) │ │
│ │ ├── whisper-cpp (CUDA build) │ │
│ │ └── .venv-qwen-tts/ (PyTorch CUDA) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Why WSL2? All existing bash scripts, Python venvs, and Node.js tooling work identically to macOS — zero porting. The dashboard API routes auto-detect macOS vs Linux at runtime via process.platform.
Phase 1: Windows-Native Setup
1. NVIDIA Drivers
# Install latest NVIDIA Game Ready or Studio drivers
# Download from: https://www.nvidia.com/Download/index.aspx
# Verify
nvidia-smi
# Should show: RTX 5090, 24 GB VRAM, CUDA 13.x+
2. Ollama (Windows-Native)
Ollama runs natively on Windows and is accessible from WSL2 at localhost:11434.
winget install --id Ollama.Ollama
# Verify
ollama --version
3. Pull Models (from Windows or WSL2)
ollama pull qwen2.5-coder:32b # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b # 4.7 GB — fast coding
ollama pull deepseek-r1:32b # 19 GB — chain-of-thought
ollama pull llama3.1:8b # 4.9 GB — fast general tasks
ollama pull sematre/orpheus:en # 4 GB — text-to-speech (8 voices)
ollama list # verify all 5 models
4. Install WSL2
# From PowerShell (Admin)
wsl --install -d Ubuntu-24.04
# Reboot if prompted, then set up username/password
Phase 2: WSL2 Setup
1. Install Dependencies
# Update
sudo apt update && sudo apt upgrade -y
# Node.js 20 LTS
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs
# Python 3.12
sudo apt install -y python3.12 python3.12-venv python3-pip
# Build tools + ffmpeg
sudo apt install -y ffmpeg git curl build-essential cmake
# Verify
node --version # 20.x+
python3.12 --version
nvidia-smi # should show RTX 5090 (GPU passthrough from Windows)
Important: Do NOT install NVIDIA drivers inside WSL2. The Windows-side driver handles GPU passthrough automatically.
2. Clone Repo
mkdir -p ~/code/mygh && cd ~/code/mygh
git clone https://github.com/saravanakumardb1/learning_ai_common_plat.git
cd learning_ai_common_plat/__LOCAL_LLMs
Performance note: Always clone inside WSL2 filesystem (
~/code/...), NOT in/mnt/c/— the Windows filesystem bridge is very slow fornode_modules.
3. Whisper.cpp (CUDA build)
cd ~
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
sudo cp build/bin/whisper-cli /usr/local/bin/
# Download model (1.5 GB)
mkdir -p ~/whisper-models
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
# Verify
whisper-cli --version
No corporate proxy on this machine — download directly from
huggingface.co.
4. TTS Setup (One-Shot)
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
# Works exactly like macOS — downloads SNAC, Qwen3-TTS, creates venv
bash setup-tts.sh
The script detects macOS vs Linux and installs the correct PyTorch variant (MPS on macOS, CUDA on Linux). On a personal machine, override the default HuggingFace mirror: HF_MIRROR=https://huggingface.co bash setup-tts.sh
5. Start Dashboard
bash start-dashboard.sh
# Open http://localhost:3000 in Windows browser
WSL2 automatically forwards ports — the dashboard is accessible from Windows at localhost:3000.
Key Differences: macOS vs WSL2
| Area | macOS (any Mac) | WSL2 (any Linux) |
|---|---|---|
| GPU | Apple Silicon (MPS) | NVIDIA (CUDA) |
| Ollama | macOS native (Metal) | Windows native, accessed via localhost |
| PyTorch device | mps |
cuda |
| Whisper install | brew install whisper-cpp |
Build from source with CUDA |
| Package manager | Homebrew | apt |
| Shell scripts | Work as-is | Work as-is |
| Python venv path | bin/python |
bin/python (same) |
| Dashboard | Identical | Identical |
| Ollama models path | ~/.ollama/models/ |
Windows %USERPROFILE%\.ollama\ |
| Model download | hf-mirror.com (corporate) |
huggingface.co (direct) |
Performance Expectations
| Workload | macOS M4 Pro 48 GB | Razer RTX 5090 24 GB |
|---|---|---|
| qwen2.5-coder:32b inference | ~15–25 tok/s | ~40–60 tok/s |
| Whisper large-v3-turbo | ~2–4x realtime | ~8–15x realtime |
| Orpheus TTS | ~realtime | ~2–3x realtime |
| Qwen3-TTS | ~realtime (MPS) | ~2–4x realtime (CUDA) |
| 70B quantized models | Fits in 48 GB (slow) | Partially offloads to 64 GB RAM |
VRAM Budget (RTX 5090 — 24 GB)
| Model | VRAM Usage | Fits in GPU? |
|---|---|---|
| llama3.1:8b | ~5 GB | ✅ Fully |
| qwen2.5-coder:7b | ~5 GB | ✅ Fully |
| sematre/orpheus:en | ~4 GB | ✅ Fully |
| qwen2.5-coder:32b | ~19 GB | ✅ Fully |
| deepseek-r1:32b | ~19 GB | ✅ Fully |
Quick Reference — Full Setup Checklist
Windows Side
[ ] Install NVIDIA drivers (Game Ready or Studio)
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull all 5 models
[ ] Install WSL2 (wsl --install -d Ubuntu-24.04)
WSL2 Side
[ ] Install Node.js 20+, Python 3.12, ffmpeg, git, cmake
[ ] Verify nvidia-smi shows RTX 5090
[ ] Clone repo into ~/code/mygh/
[ ] Build whisper-cpp with CUDA
[ ] Download Whisper model to ~/whisper-models/
[ ] Run: bash setup-tts.sh
[ ] Run: bash start-dashboard.sh
[ ] Verify: http://localhost:3000 shows all green
Troubleshooting
Ollama not accessible from WSL2
curl http://localhost:11434/api/tags
# If fails, check Windows firewall or try:
curl http://$(hostname).local:11434/api/tags
CUDA not visible in WSL2
nvidia-smi
# If "command not found":
# 1. Update Windows NVIDIA drivers to latest
# 2. Run: wsl --update
# 3. Do NOT install nvidia-driver-* inside WSL2
Slow filesystem performance
# Clone repos inside WSL2 filesystem: ~/code/...
# NOT in /mnt/c/ (Windows→WSL bridge is ~10x slower for node_modules)