# Windows Setup Guide — Local LLM Stack on Razer Blade 18 > **Hardware:** Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe > **OS:** Windows 11 Home > **Goal:** Mirror the macOS `__LOCAL_LLMs` stack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard > **See also:** [razer-blade-18-spec.md](razer-blade-18-spec.md) for full hardware specs --- ## Prerequisites ### 1. Windows Package Manager Install **winget** (ships with Windows 11) and optionally **Scoop** for CLI tools: ```powershell # Verify winget winget --version # Install Scoop (optional, useful for dev tools) Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression ``` ### 2. NVIDIA CUDA Toolkit The RTX 5090 needs the latest CUDA drivers for GPU-accelerated inference. ```powershell # Install NVIDIA drivers (latest Game Ready or Studio) winget install --id Nvidia.GeForceExperience # Install CUDA Toolkit (required for PyTorch CUDA) winget install --id Nvidia.CUDA # Or download from: https://developer.nvidia.com/cuda-downloads # Verify nvidia-smi ``` Expected output should show: - **RTX 5090** with **24 GB** VRAM - CUDA version 13.x+ ### 3. Node.js (for Mission Control Dashboard) ```powershell winget install --id OpenJS.NodeJS.LTS # Verify node --version # should be 20.x+ npm --version ``` ### 4. Python 3.12 ```powershell winget install --id Python.Python.3.12 # Verify python --version pip --version ``` ### 5. Git ```powershell winget install --id Git.Git ``` ### 6. ffmpeg ```powershell winget install --id Gyan.FFmpeg # Or: scoop install ffmpeg ``` --- ## 1. Ollama — LLM Server ### Install ```powershell winget install --id Ollama.Ollama ``` Ollama for Windows runs as a background service and automatically uses CUDA (RTX 5090). ### Verify ```powershell ollama --version curl http://localhost:11434/api/tags ``` ### Download Models ```powershell # Coding ollama pull qwen2.5-coder:32b # 19 GB — primary coding model ollama pull qwen2.5-coder:7b # 4.7 GB — fast coding # Reasoning ollama pull deepseek-r1:32b # 19 GB — chain-of-thought # General ollama pull llama3.1:8b # 4.9 GB — fast general tasks # TTS ollama pull sematre/orpheus:en # 4 GB — text-to-speech (8 voices) # Verify ollama list ``` > **Note:** With 24 GB VRAM, Ollama will offload 32B models almost entirely to GPU. > On macOS (48 GB unified), the 32B models run in shared CPU/GPU memory. > On this machine, **GPU inference will be significantly faster** for models that fit in 24 GB VRAM. ### VRAM Budget (RTX 5090 — 24 GB) | Model | VRAM Usage | Fits in GPU? | | ---------------------------- | ---------- | ------------ | | llama3.1:8b | ~5 GB | ✅ Fully | | qwen2.5-coder:7b | ~5 GB | ✅ Fully | | sematre/orpheus:en | ~4 GB | ✅ Fully | | qwen2.5-coder:32b | ~19 GB | ✅ Fully | | deepseek-r1:32b | ~19 GB | ✅ Fully | | Two 7B models simultaneously | ~10 GB | ✅ Both fit | --- ## 2. Whisper.cpp — Speech-to-Text ### Option A: Pre-built Binary (Recommended) Download the latest release from GitHub: ```powershell # Create whisper directory mkdir "$env:USERPROFILE\whisper-cpp" cd "$env:USERPROFILE\whisper-cpp" # Download latest release (CUDA build) # Check: https://github.com/ggerganov/whisper.cpp/releases # Look for: whisper-cublas-bin-x64.zip or whisper-cuda-bin-x64.zip ``` ### Option B: Build from Source (CUDA) ```powershell git clone https://github.com/ggerganov/whisper.cpp.git cd whisper.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release ``` ### Download Whisper Model ```powershell mkdir "$env:USERPROFILE\whisper-models" # Download ggml-large-v3-turbo (1.5 GB) curl -L -o "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" ` "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin" ``` > **No corporate proxy on this machine** — download directly from `huggingface.co`. > The `hf-mirror.com` workaround is only needed on the corporate MacBook. ### Verify ```powershell # Test transcription whisper-cli -m "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" -f test.wav ``` --- ## 3. TTS — Orpheus + Qwen3-TTS ### 3a. Orpheus TTS (via Ollama) Already handled in Step 1 (`ollama pull sematre/orpheus:en`). ### 3b. SNAC Decoder ```powershell # Create models directory (match macOS layout) $MODELS = "$PSScriptRoot\models" # or wherever you clone the repo mkdir "$MODELS\snac_24khz" -Force # Download SNAC decoder curl -L -o "$MODELS\snac_24khz\config.json" ` "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json" curl -L -o "$MODELS\snac_24khz\pytorch_model.bin" ` "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin" ``` ### 3c. Python Venv + Dependencies ```powershell cd __LOCAL_LLMs # Create venv python -m venv .venv-qwen-tts # Activate (Windows uses Scripts, not bin) .\.venv-qwen-tts\Scripts\Activate.ps1 # Install PyTorch with CUDA (NOT MPS — that's Apple only) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 # Install other deps pip install snac numpy soundfile # Verify CUDA python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')" # Expected: CUDA: True, Device: NVIDIA GeForce RTX 5090 Laptop GPU ``` ### 3d. Qwen3-TTS 0.6B ```powershell $MODELS = ".\models" # Tokenizer (~650 MB) mkdir "$MODELS\Qwen3-TTS-Tokenizer-12Hz" -Force foreach ($f in @("config.json", "configuration.json", "preprocessor_config.json")) { curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\$f" ` "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/$f" } curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\model.safetensors" ` "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/model.safetensors" # Model weights (~1.8 GB) mkdir "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice" -Force foreach ($f in @("config.json", "generation_config.json")) { curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\$f" ` "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/$f" } curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\model.safetensors" ` "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/model.safetensors" ``` ### 3e. Test TTS ```powershell # Activate venv .\.venv-qwen-tts\Scripts\Activate.ps1 # Orpheus TTS test python test_orpheus_tts.py # Qwen3-TTS test python test_qwen_tts.py ``` > **Key difference from macOS:** Qwen3-TTS will use **CUDA** instead of MPS. > In `test_qwen_tts.py`, the device selection `torch.device("mps")` will fall through to CUDA automatically > since `torch.backends.mps.is_available()` returns False on Windows. > You may want to update the device logic to prefer CUDA: > > ```python > device = torch.device("cuda" if torch.cuda.is_available() else "cpu") > ``` --- ## 4. Mission Control Dashboard ```powershell cd __LOCAL_LLMs\dashboard # Install dependencies npm install # Start dev server npm run dev # Open http://localhost:3000 ``` The dashboard is pure Next.js — works identically on Windows. The API routes auto-detect: - **Ollama** at `localhost:11434` - **Whisper** models in `%USERPROFILE%\whisper-models\` - **TTS** engines (Orpheus, Qwen3-TTS) and Python venv ### Start Script (PowerShell) Use the bash script equivalent: ```powershell # Quick start (manual) ollama serve # if not already running as service cd __LOCAL_LLMs\dashboard npm run dev ``` > TODO: Create `start-dashboard.ps1` as a PowerShell equivalent of `start-dashboard.sh` --- ## 5. Key Differences: macOS vs Windows | Area | macOS (M4 Pro 48 GB) | Windows (Razer Blade 18) | | ------------------- | ----------------------------------- | ------------------------------------- | | **GPU** | Apple Silicon (unified memory, MPS) | RTX 5090 (24 GB VRAM, CUDA) | | **Ollama GPU** | Automatic (Metal) | Automatic (CUDA) | | **VRAM** | Shared from 48 GB RAM | Dedicated 24 GB GDDR7 | | **PyTorch device** | `mps` | `cuda` | | **Whisper install** | `brew install whisper-cpp` | Build from source or download release | | **Python venv** | `bin/activate` | `Scripts\Activate.ps1` | | **Package manager** | Homebrew | winget / scoop | | **Shell** | zsh / bash | PowerShell / cmd | | **Scripts** | `.sh` (bash) | `.ps1` (PowerShell) | | **Model download** | `hf-mirror.com` (corporate proxy) | `huggingface.co` (no proxy) | | **Dashboard** | Identical | Identical | | **Ollama models** | Identical | Identical | ### Performance Expectations | Workload | macOS M4 Pro 48 GB | Razer RTX 5090 24 GB | | --------------------------- | ---------------------------- | ------------------------- | | qwen2.5-coder:32b inference | ~15–25 tok/s (MPS/CPU blend) | ~40–60 tok/s (full CUDA) | | Whisper large-v3-turbo | ~2–4x realtime (CPU) | ~8–15x realtime (CUDA) | | Orpheus TTS | ~realtime (CPU decode) | ~2–3x realtime (CUDA) | | Qwen3-TTS | ~realtime (MPS) | ~2–4x realtime (CUDA) | | 70B quantized models | Fits in 48 GB (slow) | Partially offloads to RAM | --- ## 6. File Layout (Same as macOS) ``` __LOCAL_LLMs/ ├── dashboard/ ← Mission Control (port 3000) — works as-is ├── models/ ← TTS model weights (gitignored) │ ├── snac_24khz/ │ ├── Qwen3-TTS-Tokenizer-12Hz/ │ └── Qwen3-TTS-12Hz-0.6B-CustomVoice/ ├── .venv-qwen-tts/ ← Python venv (Scripts\ on Windows) ├── test_orpheus_tts.py ← works as-is (device fallback) ├── test_qwen_tts.py ← update device to prefer CUDA ├── windows_specific/ │ ├── razer-blade-18-spec.md ← hardware spec │ └── setup-guide.md ← this file └── docs/ ← macOS-focused docs (still useful as reference) ``` --- ## 7. Quick Reference — Full Setup Checklist ``` [ ] Install NVIDIA drivers + CUDA Toolkit [ ] Install Ollama (winget install Ollama.Ollama) [ ] Pull models: qwen2.5-coder:32b, deepseek-r1:32b, llama3.1:8b, orpheus [ ] Install Node.js 20+ (winget) [ ] Install Python 3.12 (winget) [ ] Install Git (winget) [ ] Install ffmpeg (winget) [ ] Clone repo [ ] Download Whisper model to %USERPROFILE%\whisper-models\ [ ] Build or download whisper-cpp with CUDA [ ] Create Python venv + install PyTorch CUDA + snac [ ] Download SNAC decoder [ ] Download Qwen3-TTS tokenizer + model [ ] npm install in dashboard/ [ ] Run dashboard: npm run dev [ ] Verify: http://localhost:3000 shows all green ```