# Windows Setup Guide — Local LLM Stack on Razer Blade 18

> **Hardware:** Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe
> **OS:** Windows 11 Home
> **Goal:** Mirror the macOS `__LOCAL_LLMs` stack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard
> **See also:** [razer-blade-18-spec.md](razer-blade-18-spec.md) for full hardware specs

---

## Prerequisites

### 1. Windows Package Manager

Install **winget** (ships with Windows 11) and optionally **Scoop** for CLI tools:

```powershell
# Verify winget
winget --version

# Install Scoop (optional, useful for dev tools)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression
```

### 2. NVIDIA CUDA Toolkit

The RTX 5090 needs the latest CUDA drivers for GPU-accelerated inference.

```powershell
# Install NVIDIA drivers (latest Game Ready or Studio)
winget install --id Nvidia.GeForceExperience

# Install CUDA Toolkit (required for PyTorch CUDA)
winget install --id Nvidia.CUDA
# Or download from: https://developer.nvidia.com/cuda-downloads

# Verify
nvidia-smi
```

Expected output should show:

- **RTX 5090** with **24 GB** VRAM
- CUDA version 13.x+

### 3. Node.js (for Mission Control Dashboard)

```powershell
winget install --id OpenJS.NodeJS.LTS
# Verify
node --version   # should be 20.x+
npm --version
```

### 4. Python 3.12

```powershell
winget install --id Python.Python.3.12
# Verify
python --version
pip --version
```

### 5. Git

```powershell
winget install --id Git.Git
```

### 6. ffmpeg

```powershell
winget install --id Gyan.FFmpeg
# Or: scoop install ffmpeg
```

---

## 1. Ollama — LLM Server

### Install

```powershell
winget install --id Ollama.Ollama
```

Ollama for Windows runs as a background service and automatically uses CUDA (RTX 5090).

### Verify

```powershell
ollama --version
curl http://localhost:11434/api/tags
```

### Download Models

```powershell
# Coding
ollama pull qwen2.5-coder:32b     # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b      # 4.7 GB — fast coding

# Reasoning
ollama pull deepseek-r1:32b       # 19 GB — chain-of-thought

# General
ollama pull llama3.1:8b            # 4.9 GB — fast general tasks

# TTS
ollama pull sematre/orpheus:en    # 4 GB — text-to-speech (8 voices)

# Verify
ollama list
```

> **Note:** With 24 GB VRAM, Ollama will offload 32B models almost entirely to GPU.
> On macOS (48 GB unified), the 32B models run in shared CPU/GPU memory.
> On this machine, **GPU inference will be significantly faster** for models that fit in 24 GB VRAM.

### VRAM Budget (RTX 5090 — 24 GB)

| Model                        | VRAM Usage | Fits in GPU? |
| ---------------------------- | ---------- | ------------ |
| llama3.1:8b                  | ~5 GB      | ✅ Fully     |
| qwen2.5-coder:7b             | ~5 GB      | ✅ Fully     |
| sematre/orpheus:en           | ~4 GB      | ✅ Fully     |
| qwen2.5-coder:32b            | ~19 GB     | ✅ Fully     |
| deepseek-r1:32b              | ~19 GB     | ✅ Fully     |
| Two 7B models simultaneously | ~10 GB     | ✅ Both fit  |

---

## 2. Whisper.cpp — Speech-to-Text

### Option A: Pre-built Binary (Recommended)

Download the latest release from GitHub:

```powershell
# Create whisper directory
mkdir "$env:USERPROFILE\whisper-cpp"
cd "$env:USERPROFILE\whisper-cpp"

# Download latest release (CUDA build)
# Check: https://github.com/ggerganov/whisper.cpp/releases
# Look for: whisper-cublas-bin-x64.zip or whisper-cuda-bin-x64.zip
```

### Option B: Build from Source (CUDA)

```powershell
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
```

### Download Whisper Model

```powershell
mkdir "$env:USERPROFILE\whisper-models"

# Download ggml-large-v3-turbo (1.5 GB)
curl -L -o "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" `
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
```

> **No corporate proxy on this machine** — download directly from `huggingface.co`.
> The `hf-mirror.com` workaround is only needed on the corporate MacBook.

### Verify

```powershell
# Test transcription
whisper-cli -m "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" -f test.wav
```

---

## 3. TTS — Orpheus + Qwen3-TTS

### 3a. Orpheus TTS (via Ollama)

Already handled in Step 1 (`ollama pull sematre/orpheus:en`).

### 3b. SNAC Decoder

```powershell
# Create models directory (match macOS layout)
$MODELS = "$PSScriptRoot\models"   # or wherever you clone the repo
mkdir "$MODELS\snac_24khz" -Force

# Download SNAC decoder
curl -L -o "$MODELS\snac_24khz\config.json" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json"
curl -L -o "$MODELS\snac_24khz\pytorch_model.bin" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"
```

### 3c. Python Venv + Dependencies

```powershell
cd __LOCAL_LLMs

# Create venv
python -m venv .venv-qwen-tts

# Activate (Windows uses Scripts, not bin)
.\.venv-qwen-tts\Scripts\Activate.ps1

# Install PyTorch with CUDA (NOT MPS — that's Apple only)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install other deps
pip install snac numpy soundfile

# Verify CUDA
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
# Expected: CUDA: True, Device: NVIDIA GeForce RTX 5090 Laptop GPU
```

### 3d. Qwen3-TTS 0.6B

```powershell
$MODELS = ".\models"

# Tokenizer (~650 MB)
mkdir "$MODELS\Qwen3-TTS-Tokenizer-12Hz" -Force
foreach ($f in @("config.json", "configuration.json", "preprocessor_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/model.safetensors"

# Model weights (~1.8 GB)
mkdir "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice" -Force
foreach ($f in @("config.json", "generation_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/model.safetensors"
```

### 3e. Test TTS

```powershell
# Activate venv
.\.venv-qwen-tts\Scripts\Activate.ps1

# Orpheus TTS test
python test_orpheus_tts.py

# Qwen3-TTS test
python test_qwen_tts.py
```

> **Key difference from macOS:** Qwen3-TTS will use **CUDA** instead of MPS.
> In `test_qwen_tts.py`, the device selection `torch.device("mps")` will fall through to CUDA automatically
> since `torch.backends.mps.is_available()` returns False on Windows.
> You may want to update the device logic to prefer CUDA:
>
> ```python
> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
> ```

---

## 4. Mission Control Dashboard

```powershell
cd __LOCAL_LLMs\dashboard

# Install dependencies
npm install

# Start dev server
npm run dev
# Open http://localhost:3000
```

The dashboard is pure Next.js — works identically on Windows. The API routes auto-detect:

- **Ollama** at `localhost:11434`
- **Whisper** models in `%USERPROFILE%\whisper-models\`
- **TTS** engines (Orpheus, Qwen3-TTS) and Python venv

### Start Script (PowerShell)

Use the bash script equivalent:

```powershell
# Quick start (manual)
ollama serve    # if not already running as service
cd __LOCAL_LLMs\dashboard
npm run dev
```

> TODO: Create `start-dashboard.ps1` as a PowerShell equivalent of `start-dashboard.sh`

---

## 5. Key Differences: macOS vs Windows

| Area                | macOS (M4 Pro 48 GB)                | Windows (Razer Blade 18)              |
| ------------------- | ----------------------------------- | ------------------------------------- |
| **GPU**             | Apple Silicon (unified memory, MPS) | RTX 5090 (24 GB VRAM, CUDA)           |
| **Ollama GPU**      | Automatic (Metal)                   | Automatic (CUDA)                      |
| **VRAM**            | Shared from 48 GB RAM               | Dedicated 24 GB GDDR7                 |
| **PyTorch device**  | `mps`                               | `cuda`                                |
| **Whisper install** | `brew install whisper-cpp`          | Build from source or download release |
| **Python venv**     | `bin/activate`                      | `Scripts\Activate.ps1`                |
| **Package manager** | Homebrew                            | winget / scoop                        |
| **Shell**           | zsh / bash                          | PowerShell / cmd                      |
| **Scripts**         | `.sh` (bash)                        | `.ps1` (PowerShell)                   |
| **Model download**  | `hf-mirror.com` (corporate proxy)   | `huggingface.co` (no proxy)           |
| **Dashboard**       | Identical                           | Identical                             |
| **Ollama models**   | Identical                           | Identical                             |

### Performance Expectations

| Workload                    | macOS M4 Pro 48 GB           | Razer RTX 5090 24 GB      |
| --------------------------- | ---------------------------- | ------------------------- |
| qwen2.5-coder:32b inference | ~15–25 tok/s (MPS/CPU blend) | ~40–60 tok/s (full CUDA)  |
| Whisper large-v3-turbo      | ~2–4x realtime (CPU)         | ~8–15x realtime (CUDA)    |
| Orpheus TTS                 | ~realtime (CPU decode)       | ~2–3x realtime (CUDA)     |
| Qwen3-TTS                   | ~realtime (MPS)              | ~2–4x realtime (CUDA)     |
| 70B quantized models        | Fits in 48 GB (slow)         | Partially offloads to RAM |

---

## 6. File Layout (Same as macOS)

```
__LOCAL_LLMs/
├── dashboard/                       ← Mission Control (port 3000) — works as-is
├── models/                          ← TTS model weights (gitignored)
│   ├── snac_24khz/
│   ├── Qwen3-TTS-Tokenizer-12Hz/
│   └── Qwen3-TTS-12Hz-0.6B-CustomVoice/
├── .venv-qwen-tts/                  ← Python venv (Scripts\ on Windows)
├── test_orpheus_tts.py              ← works as-is (device fallback)
├── test_qwen_tts.py                 ← update device to prefer CUDA
├── windows_specific/
│   ├── razer-blade-18-spec.md       ← hardware spec
│   └── setup-guide.md              ← this file
└── docs/                            ← macOS-focused docs (still useful as reference)
```

---

## 7. Quick Reference — Full Setup Checklist

```
[ ] Install NVIDIA drivers + CUDA Toolkit
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull models: qwen2.5-coder:32b, deepseek-r1:32b, llama3.1:8b, orpheus
[ ] Install Node.js 20+ (winget)
[ ] Install Python 3.12 (winget)
[ ] Install Git (winget)
[ ] Install ffmpeg (winget)
[ ] Clone repo
[ ] Download Whisper model to %USERPROFILE%\whisper-models\
[ ] Build or download whisper-cpp with CUDA
[ ] Create Python venv + install PyTorch CUDA + snac
[ ] Download SNAC decoder
[ ] Download Qwen3-TTS tokenizer + model
[ ] npm install in dashboard/
[ ] Run dashboard: npm run dev
[ ] Verify: http://localhost:3000 shows all green
```