learning_ai_common_plat/__LOCAL_LLMs/windows_specific/setup-guide.md
2026-02-21 14:13:07 -08:00

11 KiB
Raw Blame History

Windows Setup Guide — Local LLM Stack on Razer Blade 18

Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home Goal: Mirror the macOS __LOCAL_LLMs stack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs


Prerequisites

1. Windows Package Manager

Install winget (ships with Windows 11) and optionally Scoop for CLI tools:

# Verify winget
winget --version

# Install Scoop (optional, useful for dev tools)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression

2. NVIDIA CUDA Toolkit

The RTX 5090 needs the latest CUDA drivers for GPU-accelerated inference.

# Install NVIDIA drivers (latest Game Ready or Studio)
winget install --id Nvidia.GeForceExperience

# Install CUDA Toolkit (required for PyTorch CUDA)
winget install --id Nvidia.CUDA
# Or download from: https://developer.nvidia.com/cuda-downloads

# Verify
nvidia-smi

Expected output should show:

  • RTX 5090 with 24 GB VRAM
  • CUDA version 13.x+

3. Node.js (for Mission Control Dashboard)

winget install --id OpenJS.NodeJS.LTS
# Verify
node --version   # should be 20.x+
npm --version

4. Python 3.12

winget install --id Python.Python.3.12
# Verify
python --version
pip --version

5. Git

winget install --id Git.Git

6. ffmpeg

winget install --id Gyan.FFmpeg
# Or: scoop install ffmpeg

1. Ollama — LLM Server

Install

winget install --id Ollama.Ollama

Ollama for Windows runs as a background service and automatically uses CUDA (RTX 5090).

Verify

ollama --version
curl http://localhost:11434/api/tags

Download Models

# Coding
ollama pull qwen2.5-coder:32b     # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b      # 4.7 GB — fast coding

# Reasoning
ollama pull deepseek-r1:32b       # 19 GB — chain-of-thought

# General
ollama pull llama3.1:8b            # 4.9 GB — fast general tasks

# TTS
ollama pull sematre/orpheus:en    # 4 GB — text-to-speech (8 voices)

# Verify
ollama list

Note: With 24 GB VRAM, Ollama will offload 32B models almost entirely to GPU. On macOS (48 GB unified), the 32B models run in shared CPU/GPU memory. On this machine, GPU inference will be significantly faster for models that fit in 24 GB VRAM.

VRAM Budget (RTX 5090 — 24 GB)

Model VRAM Usage Fits in GPU?
llama3.1:8b ~5 GB Fully
qwen2.5-coder:7b ~5 GB Fully
sematre/orpheus:en ~4 GB Fully
qwen2.5-coder:32b ~19 GB Fully
deepseek-r1:32b ~19 GB Fully
Two 7B models simultaneously ~10 GB Both fit

2. Whisper.cpp — Speech-to-Text

Download the latest release from GitHub:

# Create whisper directory
mkdir "$env:USERPROFILE\whisper-cpp"
cd "$env:USERPROFILE\whisper-cpp"

# Download latest release (CUDA build)
# Check: https://github.com/ggerganov/whisper.cpp/releases
# Look for: whisper-cublas-bin-x64.zip or whisper-cuda-bin-x64.zip

Option B: Build from Source (CUDA)

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Download Whisper Model

mkdir "$env:USERPROFILE\whisper-models"

# Download ggml-large-v3-turbo (1.5 GB)
curl -L -o "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" `
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"

No corporate proxy on this machine — download directly from huggingface.co. The hf-mirror.com workaround is only needed on the corporate MacBook.

Verify

# Test transcription
whisper-cli -m "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" -f test.wav

3. TTS — Orpheus + Qwen3-TTS

3a. Orpheus TTS (via Ollama)

Already handled in Step 1 (ollama pull sematre/orpheus:en).

3b. SNAC Decoder

# Create models directory (match macOS layout)
$MODELS = "$PSScriptRoot\models"   # or wherever you clone the repo
mkdir "$MODELS\snac_24khz" -Force

# Download SNAC decoder
curl -L -o "$MODELS\snac_24khz\config.json" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json"
curl -L -o "$MODELS\snac_24khz\pytorch_model.bin" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

3c. Python Venv + Dependencies

cd __LOCAL_LLMs

# Create venv
python -m venv .venv-qwen-tts

# Activate (Windows uses Scripts, not bin)
.\.venv-qwen-tts\Scripts\Activate.ps1

# Install PyTorch with CUDA (NOT MPS — that's Apple only)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install other deps
pip install snac numpy soundfile

# Verify CUDA
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
# Expected: CUDA: True, Device: NVIDIA GeForce RTX 5090 Laptop GPU

3d. Qwen3-TTS 0.6B

$MODELS = ".\models"

# Tokenizer (~650 MB)
mkdir "$MODELS\Qwen3-TTS-Tokenizer-12Hz" -Force
foreach ($f in @("config.json", "configuration.json", "preprocessor_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/model.safetensors"

# Model weights (~1.8 GB)
mkdir "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice" -Force
foreach ($f in @("config.json", "generation_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/model.safetensors"

3e. Test TTS

# Activate venv
.\.venv-qwen-tts\Scripts\Activate.ps1

# Orpheus TTS test
python test_orpheus_tts.py

# Qwen3-TTS test
python test_qwen_tts.py

Key difference from macOS: Qwen3-TTS will use CUDA instead of MPS. In test_qwen_tts.py, the device selection torch.device("mps") will fall through to CUDA automatically since torch.backends.mps.is_available() returns False on Windows. You may want to update the device logic to prefer CUDA:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

4. Mission Control Dashboard

cd __LOCAL_LLMs\dashboard

# Install dependencies
npm install

# Start dev server
npm run dev
# Open http://localhost:3000

The dashboard is pure Next.js — works identically on Windows. The API routes auto-detect:

  • Ollama at localhost:11434
  • Whisper models in %USERPROFILE%\whisper-models\
  • TTS engines (Orpheus, Qwen3-TTS) and Python venv

Start Script (PowerShell)

Use the bash script equivalent:

# Quick start (manual)
ollama serve    # if not already running as service
cd __LOCAL_LLMs\dashboard
npm run dev

TODO: Create start-dashboard.ps1 as a PowerShell equivalent of start-dashboard.sh


5. Key Differences: macOS vs Windows

Area macOS (M4 Pro 48 GB) Windows (Razer Blade 18)
GPU Apple Silicon (unified memory, MPS) RTX 5090 (24 GB VRAM, CUDA)
Ollama GPU Automatic (Metal) Automatic (CUDA)
VRAM Shared from 48 GB RAM Dedicated 24 GB GDDR7
PyTorch device mps cuda
Whisper install brew install whisper-cpp Build from source or download release
Python venv bin/activate Scripts\Activate.ps1
Package manager Homebrew winget / scoop
Shell zsh / bash PowerShell / cmd
Scripts .sh (bash) .ps1 (PowerShell)
Model download hf-mirror.com (corporate proxy) huggingface.co (no proxy)
Dashboard Identical Identical
Ollama models Identical Identical

Performance Expectations

Workload macOS M4 Pro 48 GB Razer RTX 5090 24 GB
qwen2.5-coder:32b inference ~1525 tok/s (MPS/CPU blend) ~4060 tok/s (full CUDA)
Whisper large-v3-turbo ~24x realtime (CPU) ~815x realtime (CUDA)
Orpheus TTS ~realtime (CPU decode) ~23x realtime (CUDA)
Qwen3-TTS ~realtime (MPS) ~24x realtime (CUDA)
70B quantized models Fits in 48 GB (slow) Partially offloads to RAM

6. File Layout (Same as macOS)

__LOCAL_LLMs/
├── dashboard/                       ← Mission Control (port 3000) — works as-is
├── models/                          ← TTS model weights (gitignored)
│   ├── snac_24khz/
│   ├── Qwen3-TTS-Tokenizer-12Hz/
│   └── Qwen3-TTS-12Hz-0.6B-CustomVoice/
├── .venv-qwen-tts/                  ← Python venv (Scripts\ on Windows)
├── test_orpheus_tts.py              ← works as-is (device fallback)
├── test_qwen_tts.py                 ← update device to prefer CUDA
├── windows_specific/
│   ├── razer-blade-18-spec.md       ← hardware spec
│   └── setup-guide.md              ← this file
└── docs/                            ← macOS-focused docs (still useful as reference)

7. Quick Reference — Full Setup Checklist

[ ] Install NVIDIA drivers + CUDA Toolkit
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull models: qwen2.5-coder:32b, deepseek-r1:32b, llama3.1:8b, orpheus
[ ] Install Node.js 20+ (winget)
[ ] Install Python 3.12 (winget)
[ ] Install Git (winget)
[ ] Install ffmpeg (winget)
[ ] Clone repo
[ ] Download Whisper model to %USERPROFILE%\whisper-models\
[ ] Build or download whisper-cpp with CUDA
[ ] Create Python venv + install PyTorch CUDA + snac
[ ] Download SNAC decoder
[ ] Download Qwen3-TTS tokenizer + model
[ ] npm install in dashboard/
[ ] Run dashboard: npm run dev
[ ] Verify: http://localhost:3000 shows all green