saravanakumardb1 f85b455eb5 ci: update CI/CD configuration

2026-02-21 14:13:07 -08:00

11 KiB

Raw Blame History

Windows Setup Guide — Local LLM Stack on Razer Blade 18

Hardware: Razer Blade 18 · Intel Core Ultra 9 275HX · RTX 5090 24 GB GDDR7 · 64 GB DDR5 · 4 TB NVMe OS: Windows 11 Home Goal: Mirror the macOS __LOCAL_LLMs stack — Ollama, Whisper, TTS (Orpheus + Qwen3), Mission Control dashboard See also: razer-blade-18-spec.md for full hardware specs

Prerequisites

1. Windows Package Manager

Install winget (ships with Windows 11) and optionally Scoop for CLI tools:

# Verify winget
winget --version

# Install Scoop (optional, useful for dev tools)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression

2. NVIDIA CUDA Toolkit

The RTX 5090 needs the latest CUDA drivers for GPU-accelerated inference.

# Install NVIDIA drivers (latest Game Ready or Studio)
winget install --id Nvidia.GeForceExperience

# Install CUDA Toolkit (required for PyTorch CUDA)
winget install --id Nvidia.CUDA
# Or download from: https://developer.nvidia.com/cuda-downloads

# Verify
nvidia-smi

Expected output should show:

RTX 5090 with 24 GB VRAM
CUDA version 13.x+

3. Node.js (for Mission Control Dashboard)

winget install --id OpenJS.NodeJS.LTS
# Verify
node --version   # should be 20.x+
npm --version

4. Python 3.12

winget install --id Python.Python.3.12
# Verify
python --version
pip --version

5. Git

winget install --id Git.Git

6. ffmpeg

winget install --id Gyan.FFmpeg
# Or: scoop install ffmpeg

1. Ollama — LLM Server

Install

winget install --id Ollama.Ollama

Ollama for Windows runs as a background service and automatically uses CUDA (RTX 5090).

Verify

ollama --version
curl http://localhost:11434/api/tags

Download Models

# Coding
ollama pull qwen2.5-coder:32b     # 19 GB — primary coding model
ollama pull qwen2.5-coder:7b      # 4.7 GB — fast coding

# Reasoning
ollama pull deepseek-r1:32b       # 19 GB — chain-of-thought

# General
ollama pull llama3.1:8b            # 4.9 GB — fast general tasks

# TTS
ollama pull sematre/orpheus:en    # 4 GB — text-to-speech (8 voices)

# Verify
ollama list

Note: With 24 GB VRAM, Ollama will offload 32B models almost entirely to GPU. On macOS (48 GB unified), the 32B models run in shared CPU/GPU memory. On this machine, GPU inference will be significantly faster for models that fit in 24 GB VRAM.

VRAM Budget (RTX 5090 — 24 GB)

Model	VRAM Usage	Fits in GPU?
llama3.1:8b	~5 GB	✅ Fully
qwen2.5-coder:7b	~5 GB	✅ Fully
sematre/orpheus:en	~4 GB	✅ Fully
qwen2.5-coder:32b	~19 GB	✅ Fully
deepseek-r1:32b	~19 GB	✅ Fully
Two 7B models simultaneously	~10 GB	✅ Both fit

2. Whisper.cpp — Speech-to-Text

Option A: Pre-built Binary (Recommended)

Download the latest release from GitHub:

# Create whisper directory
mkdir "$env:USERPROFILE\whisper-cpp"
cd "$env:USERPROFILE\whisper-cpp"

# Download latest release (CUDA build)
# Check: https://github.com/ggerganov/whisper.cpp/releases
# Look for: whisper-cublas-bin-x64.zip or whisper-cuda-bin-x64.zip

Option B: Build from Source (CUDA)

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Download Whisper Model

mkdir "$env:USERPROFILE\whisper-models"

# Download ggml-large-v3-turbo (1.5 GB)
curl -L -o "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" `
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"

No corporate proxy on this machine — download directly from huggingface.co. The hf-mirror.com workaround is only needed on the corporate MacBook.

Verify

# Test transcription
whisper-cli -m "$env:USERPROFILE\whisper-models\ggml-large-v3-turbo.bin" -f test.wav

3. TTS — Orpheus + Qwen3-TTS

3a. Orpheus TTS (via Ollama)

Already handled in Step 1 (ollama pull sematre/orpheus:en).

3b. SNAC Decoder

# Create models directory (match macOS layout)
$MODELS = "$PSScriptRoot\models"   # or wherever you clone the repo
mkdir "$MODELS\snac_24khz" -Force

# Download SNAC decoder
curl -L -o "$MODELS\snac_24khz\config.json" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/config.json"
curl -L -o "$MODELS\snac_24khz\pytorch_model.bin" `
  "https://huggingface.co/hubertsiuzdak/snac_24khz/resolve/main/pytorch_model.bin"

3c. Python Venv + Dependencies

cd __LOCAL_LLMs

# Create venv
python -m venv .venv-qwen-tts

# Activate (Windows uses Scripts, not bin)
.\.venv-qwen-tts\Scripts\Activate.ps1

# Install PyTorch with CUDA (NOT MPS — that's Apple only)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install other deps
pip install snac numpy soundfile

# Verify CUDA
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
# Expected: CUDA: True, Device: NVIDIA GeForce RTX 5090 Laptop GPU

3d. Qwen3-TTS 0.6B

$MODELS = ".\models"

# Tokenizer (~650 MB)
mkdir "$MODELS\Qwen3-TTS-Tokenizer-12Hz" -Force
foreach ($f in @("config.json", "configuration.json", "preprocessor_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-Tokenizer-12Hz\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz/resolve/main/model.safetensors"

# Model weights (~1.8 GB)
mkdir "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice" -Force
foreach ($f in @("config.json", "generation_config.json")) {
    curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\$f" `
      "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/$f"
}
curl -L -o "$MODELS\Qwen3-TTS-12Hz-0.6B-CustomVoice\model.safetensors" `
  "https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/resolve/main/model.safetensors"

3e. Test TTS

# Activate venv
.\.venv-qwen-tts\Scripts\Activate.ps1

# Orpheus TTS test
python test_orpheus_tts.py

# Qwen3-TTS test
python test_qwen_tts.py

Key difference from macOS: Qwen3-TTS will use CUDA instead of MPS. In test_qwen_tts.py, the device selection torch.device("mps") will fall through to CUDA automatically since torch.backends.mps.is_available() returns False on Windows. You may want to update the device logic to prefer CUDA:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

4. Mission Control Dashboard

cd __LOCAL_LLMs\dashboard

# Install dependencies
npm install

# Start dev server
npm run dev
# Open http://localhost:3000

The dashboard is pure Next.js — works identically on Windows. The API routes auto-detect:

Ollama at localhost:11434
Whisper models in %USERPROFILE%\whisper-models\
TTS engines (Orpheus, Qwen3-TTS) and Python venv

Start Script (PowerShell)

Use the bash script equivalent:

# Quick start (manual)
ollama serve    # if not already running as service
cd __LOCAL_LLMs\dashboard
npm run dev

TODO: Create start-dashboard.ps1 as a PowerShell equivalent of start-dashboard.sh

5. Key Differences: macOS vs Windows

Area	macOS (M4 Pro 48 GB)	Windows (Razer Blade 18)
GPU	Apple Silicon (unified memory, MPS)	RTX 5090 (24 GB VRAM, CUDA)
Ollama GPU	Automatic (Metal)	Automatic (CUDA)
VRAM	Shared from 48 GB RAM	Dedicated 24 GB GDDR7
PyTorch device	`mps`	`cuda`
Whisper install	`brew install whisper-cpp`	Build from source or download release
Python venv	`bin/activate`	`Scripts\Activate.ps1`
Package manager	Homebrew	winget / scoop
Shell	zsh / bash	PowerShell / cmd
Scripts	`.sh` (bash)	`.ps1` (PowerShell)
Model download	`hf-mirror.com` (corporate proxy)	`huggingface.co` (no proxy)
Dashboard	Identical	Identical
Ollama models	Identical	Identical

Performance Expectations

Workload	macOS M4 Pro 48 GB	Razer RTX 5090 24 GB
qwen2.5-coder:32b inference	~15–25 tok/s (MPS/CPU blend)	~40–60 tok/s (full CUDA)
Whisper large-v3-turbo	~2–4x realtime (CPU)	~8–15x realtime (CUDA)
Orpheus TTS	~realtime (CPU decode)	~2–3x realtime (CUDA)
Qwen3-TTS	~realtime (MPS)	~2–4x realtime (CUDA)
70B quantized models	Fits in 48 GB (slow)	Partially offloads to RAM

6. File Layout (Same as macOS)

__LOCAL_LLMs/
├── dashboard/                       ← Mission Control (port 3000) — works as-is
├── models/                          ← TTS model weights (gitignored)
│   ├── snac_24khz/
│   ├── Qwen3-TTS-Tokenizer-12Hz/
│   └── Qwen3-TTS-12Hz-0.6B-CustomVoice/
├── .venv-qwen-tts/                  ← Python venv (Scripts\ on Windows)
├── test_orpheus_tts.py              ← works as-is (device fallback)
├── test_qwen_tts.py                 ← update device to prefer CUDA
├── windows_specific/
│   ├── razer-blade-18-spec.md       ← hardware spec
│   └── setup-guide.md              ← this file
└── docs/                            ← macOS-focused docs (still useful as reference)

7. Quick Reference — Full Setup Checklist

[ ] Install NVIDIA drivers + CUDA Toolkit
[ ] Install Ollama (winget install Ollama.Ollama)
[ ] Pull models: qwen2.5-coder:32b, deepseek-r1:32b, llama3.1:8b, orpheus
[ ] Install Node.js 20+ (winget)
[ ] Install Python 3.12 (winget)
[ ] Install Git (winget)
[ ] Install ffmpeg (winget)
[ ] Clone repo
[ ] Download Whisper model to %USERPROFILE%\whisper-models\
[ ] Build or download whisper-cpp with CUDA
[ ] Create Python venv + install PyTorch CUDA + snac
[ ] Download SNAC decoder
[ ] Download Qwen3-TTS tokenizer + model
[ ] npm install in dashboard/
[ ] Run dashboard: npm run dev
[ ] Verify: http://localhost:3000 shows all green

11 KiB Raw Blame History Unescape Escape

Windows Setup Guide — Local LLM Stack on Razer Blade 18

Prerequisites

1. Windows Package Manager

2. NVIDIA CUDA Toolkit

3. Node.js (for Mission Control Dashboard)

4. Python 3.12

5. Git

6. ffmpeg

1. Ollama — LLM Server

Install

Verify

Download Models

VRAM Budget (RTX 5090 — 24 GB)

2. Whisper.cpp — Speech-to-Text

Option A: Pre-built Binary (Recommended)

Option B: Build from Source (CUDA)

Download Whisper Model

Verify

3. TTS — Orpheus + Qwen3-TTS

3a. Orpheus TTS (via Ollama)

3b. SNAC Decoder

3c. Python Venv + Dependencies

3d. Qwen3-TTS 0.6B

3e. Test TTS

4. Mission Control Dashboard

Start Script (PowerShell)

5. Key Differences: macOS vs Windows

Performance Expectations

6. File Layout (Same as macOS)

7. Quick Reference — Full Setup Checklist

11 KiB

Raw Blame History