- setup-windows.ps1: PowerShell script for Windows side - NVIDIA driver verification, Ollama install via winget - Pull all 5 models with skip-if-exists logic - WSL2 Ubuntu 24.04 install - setup-wsl.sh: Bash script for WSL2 side - Idempotent apt deps (Node.js 20, Python 3.12, ffmpeg, cmake) - CUDA GPU passthrough verification - Repo clone + git pull, whisper.cpp CUDA build - Whisper model download, TTS setup, dashboard start - README.md: 2-step quick start (no IDE required) - setup-guide.md: add automated setup section at top
9.3 KiB
Windows Setup Guide — Local LLM Stack on Razer Blade 18
Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home + WSL2 (Ubuntu) Goal: Mirror the macOS
__LOCAL_LLMsstack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs
Automated Setup (Recommended)
Two scripts, zero IDE required. See README.md for the quick start, or run directly:
# Step 1 — PowerShell (Admin) on Windows
Set-ExecutionPolicy -Scope Process Bypass
.\setup-windows.ps1
# Reboot if WSL2 was just installed
# Step 2 — Ubuntu (WSL2) terminal
curl -fsSL https://raw.githubusercontent.com/saravanakumardb1/learning_ai_common_plat/main/__LOCAL_LLMs/windows_specific/setup-wsl.sh | bash
The rest of this guide covers each step in detail for reference and troubleshooting.
Architecture: Windows-Native + WSL2
┌────────────────────────────────────────────────────────┐
│ Windows 11 │
│ ├── NVIDIA drivers + CUDA (native) │
│ ├── Ollama (native Windows service, port 11434) │
│ └── Browser → http://localhost:3000 │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ WSL2 (Ubuntu 24.04) │ │
│ │ ├── Node.js, Python 3.12, ffmpeg, git │ │
│ │ ├── __LOCAL_LLMs/ (cloned here) │ │
│ │ │ ├── dashboard/ → npm run dev (port 3000) │ │
│ │ │ ├── setup-tts.sh (works as-is) │ │
│ │ │ ├── start-dashboard.sh (works as-is) │ │
│ │ │ └── models/ (SNAC, Qwen3-TTS) │ │
│ │ ├── whisper-cpp (CUDA build) │ │
│ │ └── .venv-qwen-tts/ (PyTorch CUDA) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Why WSL2? All existing bash scripts, Python venvs, and Node.js tooling work identically to macOS — zero porting. The dashboard API routes auto-detect macOS vs Linux at runtime via process.platform.
Phase 1: Windows-Native Setup
1. NVIDIA Drivers
# Install latest NVIDIA Game Ready or Studio drivers
# Download from: https://www.nvidia.com/Download/index.aspx
# Verify
nvidia-smi
# Should show: RTX 5090, 24 GB VRAM, CUDA 13.x+
2. Ollama (Windows-Native)
Ollama runs natively on Windows and is accessible from WSL2 at localhost:11434.
winget install --id Ollama.Ollama
# Verify
ollama --version
3. Pull Models (from Windows or WSL2)
ollama pull qwen2.5-coder:32b # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b # 4.7 GB — fast coding
ollama pull deepseek-r1:32b # 19 GB — chain-of-thought
ollama pull llama3.1:8b # 4.9 GB — fast general tasks
ollama pull sematre/orpheus:en # 4 GB — text-to-speech (8 voices)
ollama list # verify all 5 models
4. Install WSL2
# From PowerShell (Admin)
wsl --install -d Ubuntu-24.04
# Reboot if prompted, then set up username/password
Phase 2: WSL2 Setup
1. Install Dependencies
# Update
sudo apt update && sudo apt upgrade -y
# Node.js 20 LTS
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs
# Python 3.12
sudo apt install -y python3.12 python3.12-venv python3-pip
# Build tools + ffmpeg
sudo apt install -y ffmpeg git curl build-essential cmake
# Verify
node --version # 20.x+
python3.12 --version
nvidia-smi # should show RTX 5090 (GPU passthrough from Windows)
Important: Do NOT install NVIDIA drivers inside WSL2. The Windows-side driver handles GPU passthrough automatically.
2. Clone Repo
mkdir -p ~/code/mygh && cd ~/code/mygh
git clone https://github.com/saravanakumardb1/learning_ai_common_plat.git
cd learning_ai_common_plat/__LOCAL_LLMs
Performance note: Always clone inside WSL2 filesystem (
~/code/...), NOT in/mnt/c/— the Windows filesystem bridge is very slow fornode_modules.
3. Whisper.cpp (CUDA build)
cd ~
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
sudo cp build/bin/whisper-cli /usr/local/bin/
# Download model (1.5 GB)
mkdir -p ~/whisper-models
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
# Verify
whisper-cli --version
No corporate proxy on this machine — download directly from
huggingface.co.
4. TTS Setup (One-Shot)
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
# Works exactly like macOS — downloads SNAC, Qwen3-TTS, creates venv
bash setup-tts.sh
The script detects macOS vs Linux and installs the correct PyTorch variant (MPS on macOS, CUDA on Linux). On a personal machine, override the default HuggingFace mirror: HF_MIRROR=https://huggingface.co bash setup-tts.sh
5. Start Dashboard
bash start-dashboard.sh
# Open http://localhost:3000 in Windows browser
WSL2 automatically forwards ports — the dashboard is accessible from Windows at localhost:3000.
Key Differences: macOS vs WSL2
| Area | macOS (any Mac) | WSL2 (any Linux) |
|---|---|---|
| GPU | Apple Silicon (MPS) | NVIDIA (CUDA) |
| Ollama | macOS native (Metal) | Windows native, accessed via localhost |
| PyTorch device | mps |
cuda |
| Whisper install | brew install whisper-cpp |
Build from source with CUDA |
| Package manager | Homebrew | apt |
| Shell scripts | Work as-is | Work as-is |
| Python venv path | bin/python |
bin/python (same) |
| Dashboard | Identical | Identical |
| Ollama models path | ~/.ollama/models/ |
Windows %USERPROFILE%\.ollama\ |
| Model download | hf-mirror.com (corporate) |
huggingface.co (direct) |
Performance Expectations
| Workload | macOS M4 Pro 48 GB | Razer RTX 5090 24 GB |
|---|---|---|
| qwen2.5-coder:32b inference | ~15–25 tok/s | ~40–60 tok/s |
| Whisper large-v3-turbo | ~2–4x realtime | ~8–15x realtime |
| Orpheus TTS | ~realtime | ~2–3x realtime |
| Qwen3-TTS | ~realtime (MPS) | ~2–4x realtime (CUDA) |
| 70B quantized models | Fits in 48 GB (slow) | Partially offloads to 64 GB RAM |
VRAM Budget (RTX 5090 — 24 GB)
| Model | VRAM Usage | Fits in GPU? |
|---|---|---|
| llama3.1:8b | ~5 GB | ✅ Fully |
| qwen2.5-coder:7b | ~5 GB | ✅ Fully |
| sematre/orpheus:en | ~4 GB | ✅ Fully |
| qwen2.5-coder:32b | ~19 GB | ✅ Fully |
| deepseek-r1:32b | ~19 GB | ✅ Fully |
Quick Reference — Full Setup Checklist
Windows Side
[ ] Install NVIDIA drivers (Game Ready or Studio)
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull all 5 models
[ ] Install WSL2 (wsl --install -d Ubuntu-24.04)
WSL2 Side
[ ] Install Node.js 20+, Python 3.12, ffmpeg, git, cmake
[ ] Verify nvidia-smi shows RTX 5090
[ ] Clone repo into ~/code/mygh/
[ ] Build whisper-cpp with CUDA
[ ] Download Whisper model to ~/whisper-models/
[ ] Run: bash setup-tts.sh
[ ] Run: bash start-dashboard.sh
[ ] Verify: http://localhost:3000 shows all green
Troubleshooting
Ollama not accessible from WSL2
curl http://localhost:11434/api/tags
# If fails, check Windows firewall or try:
curl http://$(hostname).local:11434/api/tags
CUDA not visible in WSL2
nvidia-smi
# If "command not found":
# 1. Update Windows NVIDIA drivers to latest
# 2. Run: wsl --update
# 3. Do NOT install nvidia-driver-* inside WSL2
Slow filesystem performance
# Clone repos inside WSL2 filesystem: ~/code/...
# NOT in /mnt/c/ (Windows→WSL bridge is ~10x slower for node_modules)