11 KiB
Windows Setup Guide — Local LLM Stack on Razer Blade 18
Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home Goal: Mirror the macOS
__LOCAL_LLMsstack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs
Prerequisites
1. Windows Package Manager
Install winget (ships with Windows 11) and optionally Scoop for CLI tools:
# Verify winget
winget --version
# Install Scoop (optional, useful for dev tools)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression
2. NVIDIA CUDA Toolkit
The RTX 5090 needs the latest CUDA drivers for GPU-accelerated inference.
# Install NVIDIA drivers (latest Game Ready or Studio)
winget install --id Nvidia.GeForceExperience
# Install CUDA Toolkit (required for PyTorch CUDA)
winget install --id Nvidia.CUDA
# Or download from: https://developer.nvidia.com/cuda-downloads
# Verify
nvidia-smi
Expected output should show:
- RTX 5090 with 24 GB VRAM
- CUDA version 13.x+
3. Node.js (for Mission Control Dashboard)
winget install --id OpenJS.NodeJS.LTS
# Verify
node --version # should be 20.x+
npm --version
4. Python 3.12
winget install --id Python.Python.3.12
# Verify
python --version
pip --version
5. Git
winget install --id Git.Git
6. ffmpeg
winget install --id Gyan.FFmpeg
# Or: scoop install ffmpeg
1. Ollama — LLM Server
Install
winget install --id Ollama.Ollama
Ollama for Windows runs as a background service and automatically uses CUDA (RTX 5090).
Verify
ollama --version
curl http://localhost:11434/api/tags
Download Models
# Coding
ollama pull qwen2.5-coder:32b # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b # 4.7 GB — fast coding
# Reasoning
ollama pull deepseek-r1:32b # 19 GB — chain-of-thought
# General
ollama pull llama3.1:8b # 4.9 GB — fast general tasks
# TTS
ollama pull sematre/orpheus:en # 4 GB — text-to-speech (8 voices)
# Verify
ollama list
Note: With 24 GB VRAM, Ollama will offload 32B models almost entirely to GPU. On macOS (48 GB unified), the 32B models run in shared CPU/GPU memory. On this machine, GPU inference will be significantly faster for models that fit in 24 GB VRAM.
VRAM Budget (RTX 5090 — 24 GB)
| Model | VRAM Usage | Fits in GPU? |
|---|---|---|
| llama3.1:8b | ~5 GB | ✅ Fully |
| qwen2.5-coder:7b | ~5 GB | ✅ Fully |
| sematre/orpheus:en | ~4 GB | ✅ Fully |
| qwen2.5-coder:32b | ~19 GB | ✅ Fully |
| deepseek-r1:32b | ~19 GB | ✅ Fully |
| Two 7B models simultaneously | ~10 GB | ✅ Both fit |
2. Whisper.cpp — Speech-to-Text
Option A: Pre-built Binary (Recommended)
Download the latest release from GitHub:
# Create whisper directory
mkdir "$env:USERPROFILE\whisper-cpp"
cd "$env:USERPROFILE\whisper-cpp"
# Download latest release (CUDA build)
# Check: https://github.com/ggerganov/whisper.cpp/releases
# Look for: whisper-cublas-bin-x64.zip or whisper-cuda-bin-x64.zip
Option B: Build from Source (CUDA)
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Download Whisper Model
mkdir "$env:USERPROFILE\whisper-models"
# Download ggml-large-v3-turbo (1.5 GB)
curl -L -o "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" `
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
No corporate proxy on this machine — download directly from
huggingface.co. Thehf-mirror.comworkaround is only needed on the corporate MacBook.
Verify
# Test transcription
whisper-cli -m "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" -f test.wav
3. TTS — Orpheus + Qwen3-TTS
3a. Orpheus TTS (via Ollama)
Already handled in Step 1 (ollama pull sematre/orpheus:en).
3b. SNAC Decoder
# Create models directory (match macOS layout)
$MODELS = "$PSScriptRoot\models" # or wherever you clone the repo
mkdir "$MODELS\snac_24khz" -Force
# Download SNAC decoder
curl -L -o "$MODELS\snac_24khz\config.json" `
"https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json"
curl -L -o "$MODELS\snac_24khz\pytorch_model.bin" `
"https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"
3c. Python Venv + Dependencies
cd __LOCAL_LLMs
# Create venv
python -m venv .venv-qwen-tts
# Activate (Windows uses Scripts, not bin)
.\.venv-qwen-tts\Scripts\Activate.ps1
# Install PyTorch with CUDA (NOT MPS — that's Apple only)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install other deps
pip install snac numpy soundfile
# Verify CUDA
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
# Expected: CUDA: True, Device: NVIDIA GeForce RTX 5090 Laptop GPU
3d. Qwen3-TTS 0.6B
$MODELS = ".\models"
# Tokenizer (~650 MB)
mkdir "$MODELS\Qwen3-TTS-Tokenizer-12Hz" -Force
foreach ($f in @("config.json", "configuration.json", "preprocessor_config.json")) {
curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\$f" `
"https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\model.safetensors" `
"https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/model.safetensors"
# Model weights (~1.8 GB)
mkdir "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice" -Force
foreach ($f in @("config.json", "generation_config.json")) {
curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\$f" `
"https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\model.safetensors" `
"https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/model.safetensors"
3e. Test TTS
# Activate venv
.\.venv-qwen-tts\Scripts\Activate.ps1
# Orpheus TTS test
python test_orpheus_tts.py
# Qwen3-TTS test
python test_qwen_tts.py
Key difference from macOS: Qwen3-TTS will use CUDA instead of MPS. In
test_qwen_tts.py, the device selectiontorch.device("mps")will fall through to CUDA automatically sincetorch.backends.mps.is_available()returns False on Windows. You may want to update the device logic to prefer CUDA:device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4. Mission Control Dashboard
cd __LOCAL_LLMs\dashboard
# Install dependencies
npm install
# Start dev server
npm run dev
# Open http://localhost:3000
The dashboard is pure Next.js — works identically on Windows. The API routes auto-detect:
- Ollama at
localhost:11434 - Whisper models in
%USERPROFILE%\whisper-models\ - TTS engines (Orpheus, Qwen3-TTS) and Python venv
Start Script (PowerShell)
Use the bash script equivalent:
# Quick start (manual)
ollama serve # if not already running as service
cd __LOCAL_LLMs\dashboard
npm run dev
TODO: Create
start-dashboard.ps1as a PowerShell equivalent ofstart-dashboard.sh
5. Key Differences: macOS vs Windows
| Area | macOS (M4 Pro 48 GB) | Windows (Razer Blade 18) |
|---|---|---|
| GPU | Apple Silicon (unified memory, MPS) | RTX 5090 (24 GB VRAM, CUDA) |
| Ollama GPU | Automatic (Metal) | Automatic (CUDA) |
| VRAM | Shared from 48 GB RAM | Dedicated 24 GB GDDR7 |
| PyTorch device | mps |
cuda |
| Whisper install | brew install whisper-cpp |
Build from source or download release |
| Python venv | bin/activate |
Scripts\Activate.ps1 |
| Package manager | Homebrew | winget / scoop |
| Shell | zsh / bash | PowerShell / cmd |
| Scripts | .sh (bash) |
.ps1 (PowerShell) |
| Model download | hf-mirror.com (corporate proxy) |
huggingface.co (no proxy) |
| Dashboard | Identical | Identical |
| Ollama models | Identical | Identical |
Performance Expectations
| Workload | macOS M4 Pro 48 GB | Razer RTX 5090 24 GB |
|---|---|---|
| qwen2.5-coder:32b inference | ~15–25 tok/s (MPS/CPU blend) | ~40–60 tok/s (full CUDA) |
| Whisper large-v3-turbo | ~2–4x realtime (CPU) | ~8–15x realtime (CUDA) |
| Orpheus TTS | ~realtime (CPU decode) | ~2–3x realtime (CUDA) |
| Qwen3-TTS | ~realtime (MPS) | ~2–4x realtime (CUDA) |
| 70B quantized models | Fits in 48 GB (slow) | Partially offloads to RAM |
6. File Layout (Same as macOS)
__LOCAL_LLMs/
├── dashboard/ ← Mission Control (port 3000) — works as-is
├── models/ ← TTS model weights (gitignored)
│ ├── snac_24khz/
│ ├── Qwen3-TTS-Tokenizer-12Hz/
│ └── Qwen3-TTS-12Hz-0.6B-CustomVoice/
├── .venv-qwen-tts/ ← Python venv (Scripts\ on Windows)
├── test_orpheus_tts.py ← works as-is (device fallback)
├── test_qwen_tts.py ← update device to prefer CUDA
├── windows_specific/
│ ├── razer-blade-18-spec.md ← hardware spec
│ └── setup-guide.md ← this file
└── docs/ ← macOS-focused docs (still useful as reference)
7. Quick Reference — Full Setup Checklist
[ ] Install NVIDIA drivers + CUDA Toolkit
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull models: qwen2.5-coder:32b, deepseek-r1:32b, llama3.1:8b, orpheus
[ ] Install Node.js 20+ (winget)
[ ] Install Python 3.12 (winget)
[ ] Install Git (winget)
[ ] Install ffmpeg (winget)
[ ] Clone repo
[ ] Download Whisper model to %USERPROFILE%\whisper-models\
[ ] Build or download whisper-cpp with CUDA
[ ] Create Python venv + install PyTorch CUDA + snac
[ ] Download SNAC decoder
[ ] Download Qwen3-TTS tokenizer + model
[ ] npm install in dashboard/
[ ] Run dashboard: npm run dev
[ ] Verify: http://localhost:3000 shows all green