docs(local-llms): add 7 RTX 5090 capability deep-dive guides

New capabilities/ subfolder with detailed guides: - 01: GPU inference speed (benchmarks, Ollama tuning, API usage) - 02: Whisper batch transcription (scripts, Python integration, use cases) - 03: TTS generation at scale (Orpheus + Qwen3, batch scripts, voice cloning) - 04: Fine-tuning / training (LoRA, QLoRA, data prep, Ollama export) - 05: CUDA / TensorRT / ML research (toolchain setup, Triton kernels, profiling) - 06: Stable Diffusion / image gen (ComfyUI, SDXL, FLUX, batch generation) - 07: Multi-GPU workloads (scaling path, eGPU, cloud, cost planning) - README: index with learning order and prerequisites Each guide covers: what it is, how to use it, benefits, skills to learn
2026-02-21 20:36:21 -08:00 · 2026-02-21 20:36:21 -08:00 · 6d18344fe0
commit 6d18344fe0
parent 1650e0da6c
8 changed files with 2237 additions and 0 deletions
--- a/__LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md
@ -0,0 +1,174 @@
+# 1. Raw GPU Inference Speed
+
+> **RTX 5090:** 2–4× faster inference on all models ≤32B compared to Mac M4 Pro
+> **Why it matters:** Faster coding suggestions, faster conversations, faster iteration
+
+---
+
+## What Is GPU Inference?
+
+When you run a model like `qwen2.5-coder:32b` through Ollama, the GPU does the heavy lifting — multiplying billions of numbers (matrix operations) to generate each token of the response. The speed of this process depends on:
+
+1. **VRAM bandwidth** — how fast data moves within the GPU
+2. **Compute cores** — how many operations run in parallel
+3. **VRAM capacity** — whether the full model fits without spilling to CPU RAM
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Token Generation Pipeline                                            │
+│                                                                      │
+│  Prompt → [Tokenize] → [GPU: Matrix Multiply] → [Sample] → Token   │
+│                              ▲                                       │
+│                              │                                       │
+│                    This is the bottleneck.                           │
+│                    RTX 5090 does this 2–4× faster.                  │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Performance: Mac vs Razer
+
+| Model                     | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
+| ------------------------- | ---------------- | --------------------- | ------- |
+| llama3.1:8b (4.9 GB)      | ~50–70 tok/s     | ~100–150 tok/s        | ~2×     |
+| qwen2.5-coder:7b (4.7 GB) | ~40–60 tok/s     | ~80–120 tok/s         | ~2×     |
+| qwen2.5-coder:32b (19 GB) | ~15–25 tok/s     | ~40–60 tok/s          | ~2.5×   |
+| deepseek-r1:32b (19 GB)   | ~15–25 tok/s     | ~40–60 tok/s          | ~2.5×   |
+| sematre/orpheus:en (4 GB) | ~realtime        | ~2–3× realtime        | ~2.5×   |
+
+### Why the RTX 5090 Is Faster
+
+```
+┌─────────────────────────┬──────────────────────┬──────────────────────────┐
+│ Metric                  │ Mac M4 Pro           │ RTX 5090                 │
+├─────────────────────────┼──────────────────────┼──────────────────────────┤
+│ GPU memory bandwidth    │ ~273 GB/s (shared)   │ ~1,000+ GB/s (GDDR7)    │
+│ Compute cores           │ 20 Metal cores       │ ~18,000 CUDA cores       │
+│ Tensor cores            │ None (Neural Engine) │ 5th/6th gen Tensor cores │
+│ FP16 throughput         │ ~25 TFLOPS           │ ~200+ TFLOPS             │
+│ Model in memory?        │ Yes (unified 48 GB)  │ Yes (24 GB VRAM)         │
+└─────────────────────────┴──────────────────────┴──────────────────────────┘
+```
+
+The RTX 5090's GDDR7 bandwidth is ~4× higher, and it has ~8× the raw FP16 compute throughput. The actual speedup is 2–4× (not 8×) because inference is mostly **memory-bandwidth bound**, not compute-bound — the GPU spends most of its time reading model weights, not computing.
+
+---
+
+## How to Use It
+
+### Basic: Ollama (Already Set Up)
+
+Ollama runs natively on Windows and uses CUDA automatically. No extra config needed.
+
+```bash
+# From WSL2 or Windows terminal
+ollama run qwen2.5-coder:32b "Write a Fastify route that validates input with Zod"
+```
+
+### Interactive Coding Assistant
+
+```bash
+# Start a conversation with the 32B coding model
+ollama run qwen2.5-coder:32b
+
+# Or use the 7B model for quick questions (faster response start)
+ollama run qwen2.5-coder:7b
+```
+
+### From the Dashboard
+
+```bash
+cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
+bash start-dashboard.sh
+# Open http://localhost:3000 — model status + inference visible
+```
+
+### Benchmark Your Actual Speed
+
+```bash
+# Quick benchmark — measure tokens per second
+time ollama run qwen2.5-coder:7b "Write a Python function that implements binary search" --verbose 2>&1 | tail -5
+
+# Compare models
+for model in llama3.1:8b qwen2.5-coder:7b qwen2.5-coder:32b; do
+    echo "=== $model ==="
+    ollama run "$model" "Hello world" --verbose 2>&1 | grep "eval rate"
+done
+```
+
+### API Access (for Scripts/Apps)
+
+```bash
+# Ollama exposes a REST API at localhost:11434
+curl -s http://localhost:11434/api/generate -d '{
+  "model": "qwen2.5-coder:32b",
+  "prompt": "Explain CUDA memory hierarchy in 3 sentences",
+  "stream": false
+}' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
+```
+
+---
+
+## Benefits
+
+### For Your LysnrAI / MindLyst Projects
+
+- **Faster code generation** — 32B model responses in ~0.5–1.5s vs ~2–4s on Mac
+- **More context in less time** — can process longer prompts without waiting
+- **Better for agentic workflows** — LangGraph agents that call LLMs multiple times per step run 2–4× faster end-to-end
+- **Batch processing** — generate embeddings, summaries, or classifications for hundreds of items quickly
+
+### For Daily Coding
+
+- **Near-instant small model responses** — 7B at 80–120 tok/s feels like reading speed
+- **Viable 32B coding assistant** — 40–60 tok/s is fast enough for real-time pair programming
+- **DeepSeek-R1 reasoning** — chain-of-thought at 40–60 tok/s makes complex reasoning practical
+
+---
+
+## Skills You'll Learn
+
+| Skill                      | What You'll Learn                                                    | Why It Matters                         |
+| -------------------------- | -------------------------------------------------------------------- | -------------------------------------- |
+| **GPU memory management**  | How VRAM allocation works, model offloading, quantization trade-offs | Core ML engineering skill              |
+| **CUDA profiling**         | Using `nvidia-smi`, `nvtop`, watching GPU utilization                | Essential for optimizing AI workloads  |
+| **Quantization**           | Q4 vs Q8 vs FP16 — speed/quality trade-offs                          | Industry-standard model deployment     |
+| **Inference optimization** | Batch size, context length, KV cache tuning                          | Key for production AI systems          |
+| **Model selection**        | When to use 7B vs 32B vs 70B for different tasks                     | Practical AI engineering judgment      |
+| **REST API integration**   | Building apps that call local LLM APIs                               | Directly applicable to LysnrAI backend |
+
+---
+
+## Advanced: Tuning Ollama for Performance
+
+```bash
+# Set number of GPU layers (default: all)
+# Useful if you want to run 2 models with partial GPU offload
+OLLAMA_NUM_GPU=999 ollama serve
+
+# Monitor GPU during inference
+watch -n 0.5 nvidia-smi
+
+# Or install nvtop for a richer GPU monitor
+sudo apt install nvtop
+nvtop
+```
+
+### Ollama Environment Variables
+
+| Variable                   | Default | Description                             |
+| -------------------------- | ------- | --------------------------------------- |
+| `OLLAMA_NUM_PARALLEL`      | 1       | Concurrent request slots                |
+| `OLLAMA_MAX_LOADED_MODELS` | 1       | Models kept in VRAM simultaneously      |
+| `OLLAMA_FLASH_ATTENTION`   | true    | Use flash attention (faster, less VRAM) |
+| `OLLAMA_GPU_OVERHEAD`      | 0       | Reserve VRAM (bytes) for other apps     |
+
+---
+
+## Next Steps
+
+- [ ] Benchmark all 5 models on the Razer and record actual tok/s
+- [ ] Try `OLLAMA_NUM_PARALLEL=4` for concurrent requests
+- [ ] Experiment with `OLLAMA_MAX_LOADED_MODELS=2` to keep 7B + 32B hot
+- [ ] Build a simple script that compares Mac vs Razer inference times
--- a/__LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md
@ -0,0 +1,306 @@
+# 2. Whisper Batch Transcription
+
+> **RTX 5090:** 8–15× realtime transcription vs 2–4× on Mac
+> **Why it matters:** Hours of audio transcribed in minutes — unlocks bulk audio workflows
+
+---
+
+## What Is Whisper?
+
+[Whisper](https://github.com/openai/whisper) is OpenAI's open-source speech-to-text model. We use [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — a high-performance C/C++ implementation that supports CUDA GPU acceleration.
+
+The `large-v3-turbo` model (~1.5 GB) delivers near-human accuracy across 99 languages.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Whisper Transcription Pipeline                                       │
+│                                                                      │
+│  Audio File (.wav/.mp3)                                              │
+│       │                                                              │
+│       ▼                                                              │
+│  [FFmpeg: decode + resample to 16kHz mono]                          │
+│       │                                                              │
+│       ▼                                                              │
+│  [Whisper: Mel spectrogram → Encoder → Decoder → Text]              │
+│       │              ▲                                               │
+│       │              │ GPU accelerates this (CUDA or Metal)         │
+│       ▼                                                              │
+│  Transcript (.txt / .srt / .vtt / .json)                            │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Performance: Mac vs Razer
+
+| Audio Length | Mac M4 Pro (Metal) | Razer RTX 5090 (CUDA) | Speedup |
+| ------------ | ------------------ | --------------------- | ------- |
+| 1 minute     | ~15–30s            | ~4–8s                 | ~3×     |
+| 10 minutes   | ~2.5–5 min         | ~40–80s               | ~3×     |
+| 1 hour       | ~15–30 min         | ~4–8 min              | ~3×     |
+| 10 hours     | ~2.5–5 hours       | ~40–80 min            | ~3–4×   |
+| 100 hours    | ~1–2 days          | ~7–13 hours           | ~3–4×   |
+
+### Realtime Multiplier
+
+| Machine        | Speed           | Meaning                  |
+| -------------- | --------------- | ------------------------ |
+| Mac M4 Pro     | ~2–4× realtime  | 1 hour audio → 15–30 min |
+| Razer RTX 5090 | ~8–15× realtime | 1 hour audio → 4–8 min   |
+
+---
+
+## How to Use It
+
+### Single File Transcription
+
+```bash
+# Basic transcription
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
+
+# With timestamps (SRT format for subtitles)
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -osrt
+
+# JSON output with word-level timestamps
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -ojf
+
+# Specify language (skip auto-detect for speed)
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l en
+```
+
+### Batch Transcription Script
+
+Create this script to transcribe an entire folder of audio files:
+
+```bash
+#!/bin/bash
+# batch-transcribe.sh — Transcribe all audio files in a directory
+# Usage: bash batch-transcribe.sh /path/to/audio/files
+
+INPUT_DIR="${1:-.}"
+MODEL="$HOME/whisper-models/ggml-large-v3-turbo.bin"
+OUTPUT_DIR="${INPUT_DIR}/transcripts"
+
+mkdir -p "$OUTPUT_DIR"
+
+echo "=== Batch Whisper Transcription ==="
+echo "Input:  $INPUT_DIR"
+echo "Output: $OUTPUT_DIR"
+echo "Model:  ggml-large-v3-turbo"
+echo ""
+
+TOTAL=0
+DONE=0
+START_TIME=$(date +%s)
+
+# Count files
+for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
+    [ -f "$f" ] && ((TOTAL++))
+done
+
+echo "Found $TOTAL audio files"
+echo ""
+
+for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
+    [ -f "$f" ] || continue
+    ((DONE++))
+
+    BASENAME=$(basename "$f" | sed 's/\.[^.]*$//')
+    echo "[$DONE/$TOTAL] Transcribing: $(basename "$f")"
+
+    whisper-cli \
+        -m "$MODEL" \
+        -f "$f" \
+        -l en \
+        -otxt \
+        -of "$OUTPUT_DIR/$BASENAME" \
+        2>/dev/null
+
+    # Also generate SRT for subtitle use
+    whisper-cli \
+        -m "$MODEL" \
+        -f "$f" \
+        -l en \
+        -osrt \
+        -of "$OUTPUT_DIR/$BASENAME" \
+        2>/dev/null
+
+    echo "  -> $OUTPUT_DIR/$BASENAME.txt"
+    echo "  -> $OUTPUT_DIR/$BASENAME.srt"
+    echo ""
+done
+
+END_TIME=$(date +%s)
+ELAPSED=$((END_TIME - START_TIME))
+echo "=== Done! $DONE files in ${ELAPSED}s ==="
+```
+
+### Convert Non-WAV Audio First
+
+```bash
+# Whisper works best with 16kHz mono WAV
+# Convert any audio format with ffmpeg
+
+# Single file
+ffmpeg -i podcast.mp3 -ar 16000 -ac 1 podcast.wav
+
+# Batch convert all MP3s in a folder
+for f in *.mp3; do
+    ffmpeg -i "$f" -ar 16000 -ac 1 "${f%.mp3}.wav"
+done
+```
+
+### Python Integration
+
+```python
+"""Transcribe audio using whisper.cpp via subprocess."""
+import subprocess
+import json
+from pathlib import Path
+
+def transcribe(audio_path: str, language: str = "en") -> dict:
+    """Transcribe an audio file and return structured result."""
+    model = Path.home() / "whisper-models" / "ggml-large-v3-turbo.bin"
+    output_base = Path(audio_path).stem
+
+    result = subprocess.run(
+        [
+            "whisper-cli",
+            "-m", str(model),
+            "-f", audio_path,
+            "-l", language,
+            "-ojf",  # JSON with full metadata
+            "-of", output_base,
+        ],
+        capture_output=True, text=True, timeout=600,
+    )
+
+    # Read the JSON output
+    json_path = Path(f"{output_base}.json")
+    if json_path.exists():
+        with open(json_path) as f:
+            return json.load(f)
+
+    return {"error": result.stderr, "text": result.stdout}
+
+# Usage
+result = transcribe("meeting-recording.wav")
+print(result["transcription"][0]["text"])
+```
+
+---
+
+## Real-World Use Cases
+
+### 1. LysnrAI Voice Dictation Pipeline
+
+Your LysnrAI desktop app captures voice → sends to Whisper for transcription. On the Razer, this becomes near-instant:
+
+```
+Voice input (5 seconds) → Whisper (CUDA) → Text in <1 second
+```
+
+### 2. Meeting Transcription
+
+```bash
+# Record a 1-hour Zoom meeting → get full transcript in ~5 minutes
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
+    -f zoom-meeting.wav -l en -otxt -osrt
+```
+
+### 3. Podcast / YouTube Processing
+
+```bash
+# Download YouTube audio
+yt-dlp -x --audio-format wav "https://youtube.com/watch?v=..." -o "video.wav"
+
+# Transcribe
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f video.wav -otxt -osrt
+```
+
+### 4. Subtitle Generation
+
+```bash
+# Generate SRT subtitles for video
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
+    -f movie.wav -l en -osrt
+
+# Output: movie.srt — ready to import into video editors
+```
+
+### 5. Multi-Language Transcription
+
+```bash
+# Auto-detect language
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
+
+# Force specific language
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ja  # Japanese
+whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ta  # Tamil
+```
+
+---
+
+## Benefits
+
+| Benefit         | Description                                                  |
+| --------------- | ------------------------------------------------------------ |
+| **Speed**       | Process a full day of meetings in under an hour              |
+| **Privacy**     | All transcription runs locally — no data leaves your machine |
+| **Cost**        | Zero API costs (vs $0.006/min for cloud Whisper API)         |
+| **Accuracy**    | large-v3-turbo is near-human accuracy for English            |
+| **Offline**     | Works without internet — useful on flights, trains           |
+| **Batch scale** | Process hundreds of files overnight                          |
+
+### Cost Comparison (100 hours of audio)
+
+| Method                     | Cost   | Time             |
+| -------------------------- | ------ | ---------------- |
+| OpenAI Whisper API         | ~$36   | ~minutes (cloud) |
+| Mac M4 Pro (local)         | $0     | ~25–50 hours     |
+| **Razer RTX 5090 (local)** | **$0** | **~7–13 hours**  |
+
+---
+
+## Skills You'll Learn
+
+| Skill                        | What You'll Learn                                      | Career Value                           |
+| ---------------------------- | ------------------------------------------------------ | -------------------------------------- |
+| **Audio processing**         | Sample rates, codecs, mono/stereo, WAV vs compressed   | Foundational for any audio/speech work |
+| **Speech-to-text pipelines** | Mel spectrograms, encoder-decoder models, beam search  | Core ML/NLP skill                      |
+| **CUDA acceleration**        | How GPU parallelism speeds up neural network inference | Top ML engineering skill               |
+| **Batch processing**         | Shell scripting for processing thousands of files      | DevOps / data engineering              |
+| **Subtitle formats**         | SRT, VTT, JSON — standards for timed text              | Media tech / accessibility             |
+| **Model quantization**       | Understanding why ggml models are smaller and faster   | ML deployment knowledge                |
+| **ffmpeg mastery**           | Audio/video conversion, resampling, format detection   | Essential multimedia tool              |
+| **Python subprocess**        | Integrating CLI tools into Python applications         | Backend engineering pattern            |
+
+---
+
+## Monitoring GPU During Transcription
+
+```bash
+# Watch GPU utilization in real-time
+watch -n 0.5 nvidia-smi
+
+# Or use nvtop for a richer view
+sudo apt install nvtop
+nvtop
+
+# Expected during Whisper:
+# GPU Utilization: 80–95%
+# VRAM Usage: ~2–3 GB (model + buffers)
+# Power: ~150–200W
+```
+
+---
+
+## Next Steps
+
+- [ ] Transcribe a test audio file and verify output quality
+- [ ] Create `batch-transcribe.sh` in `__LOCAL_LLMs/scripts/`
+- [ ] Benchmark: time a 10-minute file on Razer vs Mac
+- [ ] Try multi-language transcription (English + Tamil)
+- [ ] Integrate Whisper output into LysnrAI transcription pipeline
+- [ ] Experiment with `whisper-cli --translate` for translation mode
--- a/__LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md
@ -0,0 +1,303 @@
+# 3. TTS Generation at Scale
+
+> **RTX 5090:** Qwen3-TTS at 2–4× realtime, Orpheus at 2–3× realtime
+> **Why it matters:** Pre-generate audio libraries, build voice features, create content — all faster than real-time playback
+
+---
+
+## What Is Local TTS?
+
+Text-to-Speech (TTS) converts written text into natural-sounding speech. Our stack has two engines:
+
+| Engine        | Architecture                    | Size        | Voices                  | Quality                               |
+| ------------- | ------------------------------- | ----------- | ----------------------- | ------------------------------------- |
+| **Orpheus**   | Ollama-served, SNAC decoder     | 4 GB        | 8 English voices        | Extremely natural, emotional          |
+| **Qwen3-TTS** | PyTorch model, direct inference | 0.6B params | 10 languages, cloneable | Multilingual, zero-shot voice cloning |
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ TTS Pipeline                                                         │
+│                                                                      │
+│  Orpheus:                                                            │
+│  Text → [Ollama: generate audio tokens] → [SNAC: decode to WAV]    │
+│              ▲ GPU (CUDA)                     ▲ GPU (CUDA)          │
+│                                                                      │
+│  Qwen3-TTS:                                                         │
+│  Text → [PyTorch model: text→mel→audio] → WAV file                 │
+│              ▲ GPU (CUDA / MPS)                                      │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Performance: Mac vs Razer
+
+| Engine                       | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
+| ---------------------------- | ---------------- | --------------------- | ------- |
+| Orpheus (per sentence)       | ~realtime        | ~2–3× realtime        | ~2.5×   |
+| Qwen3-TTS (per sentence)     | ~realtime        | ~2–4× realtime        | ~3×     |
+| Orpheus (10 min narration)   | ~10 min          | ~3–5 min              | ~2.5×   |
+| Qwen3-TTS (10 min narration) | ~10 min          | ~2.5–5 min            | ~3×     |
+| Batch: 100 sentences         | ~5–8 min         | ~2–3 min              | ~3×     |
+| Batch: 1000 sentences        | ~50–80 min       | ~15–25 min            | ~3×     |
+
+**"2–4× realtime" means:** A 10-second sentence generates in 2.5–5 seconds. The audio is produced faster than you could listen to it.
+
+---
+
+## How to Use It
+
+### Orpheus TTS (8 Natural Voices)
+
+Orpheus runs through Ollama + SNAC decoder. Already set up by `setup-tts.sh`.
+
+```bash
+cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
+
+# Generate speech with default voice (tara)
+.venv-qwen-tts/bin/python test_orpheus_tts.py
+
+# Output: test_orpheus_tara.wav, test_orpheus_leah.wav, etc.
+# Play: aplay test_orpheus_tara.wav
+```
+
+#### Available Orpheus Voices
+
+| Voice  | Character          | Best For              |
+| ------ | ------------------ | --------------------- |
+| `tara` | Young female, warm | Narration, assistants |
+| `leah` | Female, clear      | Tutorials, guides     |
+| `jess` | Female, energetic  | Announcements         |
+| `leo`  | Male, calm         | Narration, podcasts   |
+| `dan`  | Male, professional | Business content      |
+| `mia`  | Female, friendly   | Conversational        |
+| `zac`  | Male, young        | Casual content        |
+| `zoe`  | Female, neutral    | General purpose       |
+
+#### Custom Text with Orpheus (Python)
+
+```python
+"""Generate speech from custom text using Orpheus TTS."""
+import json
+import struct
+import wave
+import urllib.request
+
+OLLAMA_URL = "http://localhost:11434/api/generate"
+
+def generate_speech(text: str, voice: str = "tara", output_path: str = "output.wav"):
+    """Generate a WAV file from text using Orpheus via Ollama."""
+    prompt = f"<|audio|>{voice}: {text}"
+
+    payload = json.dumps({
+        "model": "sematre/orpheus:en",
+        "prompt": prompt,
+        "stream": False,
+        "options": {"temperature": 0.6, "top_p": 0.9}
+    }).encode()
+
+    req = urllib.request.Request(
+        OLLAMA_URL,
+        data=payload,
+        headers={"Content-Type": "application/json"}
+    )
+
+    with urllib.request.urlopen(req, timeout=120) as resp:
+        result = json.loads(resp.read())
+
+    # Decode audio tokens → SNAC → WAV
+    # (Full implementation in test_orpheus_tts.py)
+    print(f"Generated: {output_path}")
+
+# Example
+generate_speech(
+    "Welcome to LysnrAI. Your voice-first productivity assistant.",
+    voice="tara",
+    output_path="welcome.wav"
+)
+```
+
+### Qwen3-TTS (Multilingual + Voice Cloning)
+
+```bash
+cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
+
+# Generate speech with Qwen3-TTS
+.venv-qwen-tts/bin/python test_qwen_tts.py
+
+# Output: test_qwen3_tts_output.wav
+# Play: aplay test_qwen3_tts_output.wav
+```
+
+#### Qwen3-TTS Features
+
+| Feature             | Description                                                                               |
+| ------------------- | ----------------------------------------------------------------------------------------- |
+| **10 languages**    | English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian |
+| **Voice cloning**   | Provide a reference audio clip → model clones the voice                                   |
+| **Emotion control** | Adjust speaking style via prompt engineering                                              |
+| **0.6B parameters** | Small enough to run fast, large enough for quality                                        |
+
+### Batch TTS Generation Script
+
+```bash
+#!/bin/bash
+# batch-tts.sh — Generate audio for a list of sentences
+# Usage: bash batch-tts.sh sentences.txt output_dir/
+
+INPUT_FILE="${1:-sentences.txt}"
+OUTPUT_DIR="${2:-tts_output}"
+VOICE="${3:-tara}"
+
+mkdir -p "$OUTPUT_DIR"
+
+echo "=== Batch Orpheus TTS ==="
+echo "Input:  $INPUT_FILE"
+echo "Output: $OUTPUT_DIR"
+echo "Voice:  $VOICE"
+echo ""
+
+LINE_NUM=0
+while IFS= read -r line; do
+    [ -z "$line" ] && continue
+    ((LINE_NUM++))
+
+    echo "[$LINE_NUM] Generating: ${line:0:60}..."
+
+    # Use the Python TTS script with custom text
+    .venv-qwen-tts/bin/python -c "
+import test_orpheus_tts as tts
+tts.generate_and_save('$line', '$VOICE', '$OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav')
+" 2>/dev/null
+
+    echo "  -> $OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav"
+done < "$INPUT_FILE"
+
+echo ""
+echo "=== Done! $LINE_NUM files generated ==="
+```
+
+---
+
+## Real-World Use Cases
+
+### 1. LysnrAI Voice Responses
+
+Generate spoken responses from LLM output — the Razer can produce audio faster than the user can listen:
+
+```
+User asks question → LLM generates text → TTS converts to speech → User hears answer
+                                              ▲
+                                    2–4× realtime on RTX 5090
+                                    Feels instant for short responses
+```
+
+### 2. Pre-Generated Audio Libraries
+
+Build a library of common phrases, greetings, and responses:
+
+```bash
+# sentences.txt
+Welcome to LysnrAI.
+Your daily brief is ready.
+You have three new memories to review.
+Recording started.
+Recording saved successfully.
+Transcription complete.
+```
+
+```bash
+# Generate all phrases in multiple voices
+for voice in tara leo dan; do
+    bash batch-tts.sh sentences.txt audio_library/ "$voice"
+done
+```
+
+### 3. Audiobook / Podcast Generation
+
+```python
+# Split a document into paragraphs and generate audio for each
+paragraphs = open("article.txt").read().split("\n\n")
+
+for i, para in enumerate(paragraphs):
+    generate_speech(para, voice="leo", output_path=f"chapter_{i:03d}.wav")
+
+# Concatenate with ffmpeg
+# ffmpeg -f concat -i filelist.txt -c copy full_audiobook.wav
+```
+
+### 4. Multilingual Content (Qwen3-TTS)
+
+```python
+# Generate the same message in multiple languages
+messages = {
+    "en": "Welcome to MindLyst, your AI-powered life organizer.",
+    "ja": "MindLystへようこそ。AIライフオーガナイザーです。",
+    "es": "Bienvenido a MindLyst, tu organizador de vida con IA.",
+}
+
+for lang, text in messages.items():
+    generate_qwen_tts(text, output_path=f"welcome_{lang}.wav")
+```
+
+### 5. Voice Cloning (Qwen3-TTS)
+
+```python
+# Clone a voice from a reference audio sample
+# Provide a 5–15 second reference clip of the target voice
+reference_audio = "my_voice_sample.wav"
+text = "This is my cloned voice saying something new."
+
+# Qwen3-TTS can reproduce the voice characteristics
+generate_qwen_tts(text, reference=reference_audio, output_path="cloned.wav")
+```
+
+---
+
+## Benefits
+
+| Benefit           | Description                                              |
+| ----------------- | -------------------------------------------------------- |
+| **Speed**         | Generate audio faster than playback speed                |
+| **Privacy**       | All voice generation runs locally — no cloud APIs        |
+| **Cost**          | $0 vs $0.015/1K chars for cloud TTS (Google, ElevenLabs) |
+| **Voice variety** | 8 Orpheus voices + unlimited Qwen3-TTS voice cloning     |
+| **Multilingual**  | Qwen3-TTS supports 10 languages natively                 |
+| **Offline**       | Works without internet                                   |
+| **Customizable**  | Control emotion, speed, voice characteristics            |
+
+### Cost Comparison (Generate 1 hour of audio)
+
+| Method                     | Cost    | Time             |
+| -------------------------- | ------- | ---------------- |
+| ElevenLabs API             | ~$15–30 | ~minutes (cloud) |
+| Google Cloud TTS           | ~$4–16  | ~minutes (cloud) |
+| Mac M4 Pro (local)         | $0      | ~60 min          |
+| **Razer RTX 5090 (local)** | **$0**  | **~15–30 min**   |
+
+---
+
+## Skills You'll Learn
+
+| Skill                     | What You'll Learn                                       | Career Value                   |
+| ------------------------- | ------------------------------------------------------- | ------------------------------ |
+| **Speech synthesis**      | How neural TTS works (text→tokens→mel→audio)            | Core speech/NLP skill          |
+| **Audio codecs**          | SNAC, Encodec, WAV format, sample rates                 | Audio engineering fundamentals |
+| **Voice cloning**         | Zero-shot voice cloning techniques                      | Cutting-edge ML research area  |
+| **Batch processing**      | Automating large-scale audio generation                 | Production engineering         |
+| **GPU memory**            | Managing VRAM for concurrent TTS + LLM workloads        | ML ops knowledge               |
+| **Audio post-processing** | ffmpeg: concatenation, normalization, format conversion | Multimedia engineering         |
+| **API design**            | Building REST APIs around TTS engines                   | Backend engineering            |
+| **Multilingual NLP**      | Cross-language text processing and pronunciation        | Global product development     |
+
+---
+
+## Next Steps
+
+- [ ] Generate test audio with both Orpheus and Qwen3-TTS on Razer
+- [ ] Create `batch-tts.sh` in `__LOCAL_LLMs/scripts/`
+- [ ] Build a pre-generated audio library for LysnrAI common phrases
+- [ ] Experiment with Qwen3-TTS voice cloning using your own voice
+- [ ] Try generating audio in Tamil (Qwen3-TTS multilingual)
+- [ ] Measure actual generation speed: words-per-second on each engine
--- a/__LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md
@ -0,0 +1,322 @@
+# 4. Fine-Tuning / Training
+
+> **RTX 5090:** 24 GB VRAM enables LoRA fine-tuning of 7B–13B models locally
+> **Why it matters:** Customize models for your specific use cases — coding style, domain knowledge, voice commands
+
+---
+
+## What Is Fine-Tuning?
+
+Fine-tuning takes a pre-trained model (like Llama 3.1 8B) and trains it further on your own data to specialize its behavior. Instead of training from scratch (which costs millions), you adjust a small fraction of the model's weights.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Fine-Tuning vs Prompting                                             │
+│                                                                      │
+│  Prompting:    "You are a coding assistant for TypeScript..."       │
+│                Works OK, but limited by context window              │
+│                Model doesn't truly "learn" your preferences         │
+│                                                                      │
+│  Fine-Tuning:  Train on 1000s of your code examples                │
+│                Model internalizes your coding patterns              │
+│                Better quality, no prompt overhead, faster inference  │
+│                                                                      │
+│  LoRA:         Fine-tune only ~1–5% of parameters                  │
+│                Needs 16–24 GB VRAM (fits RTX 5090)                  │
+│                Training time: hours, not days                        │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+### Why Not on Mac?
+
+| Aspect            | Mac M4 Pro (MPS)               | RTX 5090 (CUDA)    |
+| ----------------- | ------------------------------ | ------------------ |
+| Training support  | Limited MPS support            | Full CUDA + cuDNN  |
+| Framework support | PyTorch MPS (some ops missing) | Full PyTorch CUDA  |
+| VRAM for training | ~30 GB usable (shared)         | 24 GB dedicated    |
+| Memory bandwidth  | ~273 GB/s                      | ~1,000+ GB/s       |
+| Training speed    | 5–10× slower than CUDA         | Baseline           |
+| LoRA libraries    | Partial compatibility          | Full compatibility |
+
+**Training is compute AND memory bandwidth intensive** — the RTX 5090's ~1 TB/s VRAM bandwidth makes it 5–10× faster than MPS for gradient computation.
+
+---
+
+## Fine-Tuning Methods
+
+### LoRA (Low-Rank Adaptation) — Recommended
+
+Trains only small "adapter" matrices (~1–5% of model parameters). The base model stays frozen.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ LoRA Architecture                                                    │
+│                                                                      │
+│  Base Model (frozen, ~16 GB)                                         │
+│  ┌─────────────────────────────────────────────┐                    │
+│  │ Layer 1: [Attention] [FFN]                   │                    │
+│  │ Layer 2: [Attention] [FFN]                   │ ← Not modified    │
+│  │ ...                                          │                    │
+│  │ Layer N: [Attention] [FFN]                   │                    │
+│  └─────────────────────────────────────────────┘                    │
+│       ↕ small adapters (rank 8–64)                                   │
+│  ┌─────────────────────────────────────────────┐                    │
+│  │ LoRA Adapter A (64 KB per layer)             │ ← Trainable       │
+│  │ LoRA Adapter B (64 KB per layer)             │ ← Trainable       │
+│  └─────────────────────────────────────────────┘                    │
+│                                                                      │
+│  Total trainable params: ~10–50 MB (vs 8–16 GB base)               │
+│  VRAM needed: ~18–22 GB for 7B model                                │
+│  Training time: ~1–4 hours for 1000 examples                        │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+### QLoRA (Quantized LoRA) — For Larger Models
+
+Loads the base model in 4-bit quantization, trains LoRA adapters in FP16. Halves memory requirements.
+
+| Method            | 7B Model VRAM | 13B Model VRAM | 32B Model VRAM |
+| ----------------- | ------------- | -------------- | -------------- |
+| Full fine-tune    | ~56 GB        | ~104 GB        | ~256 GB        |
+| LoRA (FP16)       | ~18 GB        | ~32 GB         | ~72 GB         |
+| **QLoRA (4-bit)** | **~8 GB**     | **~14 GB**     | **~22 GB**     |
+
+**QLoRA at 32B fits in 24 GB VRAM** — this is remarkable for a laptop.
+
+---
+
+## How to Use It
+
+### Setup: Install Training Libraries
+
+```bash
+# Activate the existing TTS venv (or create a new one)
+cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
+python3.12 -m venv .venv-training
+source .venv-training/bin/activate
+
+# Install training stack
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+pip install transformers datasets accelerate peft bitsandbytes trl
+pip install wandb  # optional: experiment tracking
+```
+
+### Example 1: Fine-Tune Llama 3.1 8B for Your Coding Style
+
+```python
+"""Fine-tune Llama 3.1 8B with LoRA on your coding examples."""
+from datasets import Dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
+from peft import LoraConfig, get_peft_model
+from trl import SFTTrainer
+
+# 1. Prepare your data
+# Format: instruction → response pairs from YOUR code
+training_data = [
+    {
+        "instruction": "Write a Fastify route that creates a user with Zod validation",
+        "response": """import { z } from 'zod';
+import type { FastifyInstance } from 'fastify';
+
+const CreateUserSchema = z.object({
+  email: z.string().email(),
+  name: z.string().min(1).max(100),
+  productId: z.string(),
+});
+
+export async function userRoutes(app: FastifyInstance) {
+  app.post('/users', async (req, reply) => {
+    const body = CreateUserSchema.parse(req.body);
+    const user = await req.server.userRepository.create(body);
+    return reply.status(201).send(user);
+  });
+}"""
+    },
+    # Add 100–1000 more examples from your actual codebase
+]
+
+dataset = Dataset.from_list([
+    {"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
+    for d in training_data
+])
+
+# 2. Load model in 4-bit (QLoRA)
+model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    load_in_4bit=True,
+    device_map="auto",
+)
+
+# 3. Configure LoRA
+lora_config = LoraConfig(
+    r=16,                # Rank (higher = more capacity, more VRAM)
+    lora_alpha=32,       # Scaling factor
+    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# trainable params: ~10M / total: ~8B (0.13%)
+
+# 4. Train
+training_args = TrainingArguments(
+    output_dir="./lora-llama-coding",
+    num_train_epochs=3,
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=4,
+    learning_rate=2e-4,
+    warmup_steps=10,
+    logging_steps=10,
+    save_steps=100,
+    fp16=True,
+)
+
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=dataset,
+    args=training_args,
+    tokenizer=tokenizer,
+    dataset_text_field="text",
+    max_seq_length=2048,
+)
+
+trainer.train()
+
+# 5. Save the adapter (small file, ~50 MB)
+model.save_pretrained("./lora-llama-coding")
+tokenizer.save_pretrained("./lora-llama-coding")
+```
+
+### Example 2: Fine-Tune for LysnrAI Voice Commands
+
+```python
+# Train a model to understand your specific voice command patterns
+training_data = [
+    {"instruction": "Parse: remind me to call john tomorrow at 3pm",
+     "response": '{"action": "reminder", "contact": "john", "time": "tomorrow 3pm", "task": "call"}'},
+    {"instruction": "Parse: add milk to my grocery list",
+     "response": '{"action": "add_item", "list": "grocery", "item": "milk"}'},
+    {"instruction": "Parse: summarize my meeting notes from yesterday",
+     "response": '{"action": "summarize", "source": "meeting_notes", "date": "yesterday"}'},
+    # ... hundreds more examples
+]
+```
+
+### Example 3: Convert LoRA to Ollama Model
+
+After training, merge the LoRA adapter and convert to GGUF for Ollama:
+
+```bash
+# Merge LoRA adapter back into base model
+python -c "
+from peft import AutoPeftModelForCausalLM
+
+model = AutoPeftModelForCausalLM.from_pretrained('./lora-llama-coding')
+merged = model.merge_and_unload()
+merged.save_pretrained('./merged-llama-coding')
+"
+
+# Convert to GGUF (requires llama.cpp)
+cd ~/llama.cpp
+python convert_hf_to_gguf.py ../merged-llama-coding --outtype q4_k_m
+
+# Create Ollama model
+cat > Modelfile <<EOF
+FROM ./merged-llama-coding.gguf
+SYSTEM "You are a TypeScript coding assistant specialized in Fastify, Zod, and Azure Cosmos DB."
+EOF
+
+ollama create my-coding-model -f Modelfile
+ollama run my-coding-model
+```
+
+---
+
+## What Models Can You Fine-Tune on 24 GB VRAM?
+
+| Model         | Method      | VRAM Usage | Feasible?      | Training Time (1K examples) |
+| ------------- | ----------- | ---------- | -------------- | --------------------------- |
+| Llama 3.1 8B  | QLoRA 4-bit | ~8 GB      | ✅ Comfortable | ~1–2 hours                  |
+| Qwen 2.5 7B   | QLoRA 4-bit | ~8 GB      | ✅ Comfortable | ~1–2 hours                  |
+| Llama 3.1 8B  | LoRA FP16   | ~18 GB     | ✅ Fits        | ~2–3 hours                  |
+| Mistral 7B    | QLoRA 4-bit | ~8 GB      | ✅ Comfortable | ~1–2 hours                  |
+| Llama 3.1 13B | QLoRA 4-bit | ~14 GB     | ✅ Fits        | ~3–5 hours                  |
+| CodeLlama 34B | QLoRA 4-bit | ~22 GB     | ⚠️ Tight       | ~6–10 hours                 |
+| Llama 3.1 70B | QLoRA 4-bit | ~40 GB     | ❌ Too large   | Need multi-GPU              |
+
+---
+
+## Benefits
+
+| Benefit              | Description                                                         |
+| -------------------- | ------------------------------------------------------------------- |
+| **Personalization**  | Model learns YOUR coding style, patterns, and conventions           |
+| **Domain expertise** | Train on your project's specific API patterns, schemas, terminology |
+| **Smaller + faster** | A fine-tuned 7B can outperform a generic 32B on your specific tasks |
+| **Privacy**          | Your training data never leaves your machine                        |
+| **Cost**             | $0 vs $100–1000s for cloud fine-tuning (OpenAI, Together AI)        |
+| **Iteration speed**  | Train → test → adjust in hours, not days                            |
+| **Portable output**  | Export to GGUF/Ollama — runs anywhere                               |
+
+---
+
+## Skills You'll Learn
+
+| Skill                      | What You'll Learn                                        | Career Value                                  |
+| -------------------------- | -------------------------------------------------------- | --------------------------------------------- |
+| **LoRA / QLoRA**           | Parameter-efficient fine-tuning techniques               | Top ML engineering skill (2024–2026 standard) |
+| **Hugging Face ecosystem** | transformers, datasets, peft, trl, accelerate            | Industry-standard ML tooling                  |
+| **Training loops**         | Loss functions, learning rates, gradient accumulation    | Core ML fundamentals                          |
+| **Data preparation**       | Curating, cleaning, formatting training datasets         | Critical for any ML project                   |
+| **Quantization**           | 4-bit, 8-bit, FP16 — trade-offs and techniques           | Essential for deployment                      |
+| **Model evaluation**       | Perplexity, human eval, A/B testing fine-tuned vs base   | ML product development                        |
+| **VRAM management**        | Gradient checkpointing, mixed precision, batch sizing    | GPU optimization                              |
+| **Model merging**          | Merging LoRA adapters, converting to GGUF                | ML deployment pipeline                        |
+| **Experiment tracking**    | Weights & Biases, training curves, hyperparameter tuning | Professional ML workflow                      |
+
+---
+
+## Training Data Sources for Your Projects
+
+| Source                 | Data Type              | Fine-Tune Goal           |
+| ---------------------- | ---------------------- | ------------------------ |
+| Your GitHub repos      | TypeScript/Python code | Coding style model       |
+| LysnrAI voice commands | Command → JSON pairs   | Voice command parser     |
+| MindLyst triage logs   | Input → categorization | Content triage model     |
+| Your commit messages   | Diff → message pairs   | Commit message generator |
+| Your code reviews      | Code → feedback pairs  | Code review assistant    |
+| Slack/Teams messages   | Conversations          | Writing style model      |
+
+### Extracting Training Data from Your Repos
+
+```bash
+# Extract all TypeScript functions as training examples
+find ~/code/mygh/learning_ai_common_plat -name "*.ts" -exec grep -l "export" {} \; | head -20
+
+# Extract commit messages paired with diffs
+cd ~/code/mygh/learning_ai_common_plat
+git log --oneline -100 --format="%H %s" | while read hash msg; do
+    echo "=== $msg ==="
+    git diff "$hash^" "$hash" --stat
+    echo ""
+done
+```
+
+---
+
+## Next Steps
+
+- [ ] Install training libraries in a dedicated venv
+- [ ] Collect 100+ instruction/response pairs from your codebase
+- [ ] Run a QLoRA fine-tune of Llama 3.1 8B on your coding examples
+- [ ] Evaluate: compare fine-tuned model vs base model on 20 test prompts
+- [ ] Convert to GGUF and serve through Ollama
+- [ ] Fine-tune a voice command parser for LysnrAI
+- [ ] Experiment with different LoRA ranks (8, 16, 32, 64) and measure quality
--- a/__LOCAL_LLMs/windows_specific/capabilities/05-cuda-tensorrt-ml-research.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/05-cuda-tensorrt-ml-research.md
@ -0,0 +1,382 @@
+# 5. CUDA / TensorRT / ML Research
+
+> **RTX 5090:** Full NVIDIA toolchain — CUDA 13.x, cuDNN, TensorRT, Triton
+> **Why it matters:** Most ML papers, frameworks, and production systems are CUDA-first. This is the industry-standard GPU compute platform.
+
+---
+
+## What Is the NVIDIA ML Toolchain?
+
+NVIDIA provides a layered stack of tools that turn your GPU into a general-purpose compute engine:
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ NVIDIA ML Toolchain (RTX 5090)                                       │
+│                                                                      │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │ Your Code (Python / C++ / TypeScript)                        │    │
+│  └────────────────────────┬────────────────────────────────────┘    │
+│                           ▼                                          │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │ ML Frameworks                                                │    │
+│  │ PyTorch · TensorFlow · JAX · ONNX Runtime                  │    │
+│  └────────────────────────┬────────────────────────────────────┘    │
+│                           ▼                                          │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │ NVIDIA Libraries                                             │    │
+│  │ TensorRT · Triton · cuDNN · cuBLAS · NCCL                  │    │
+│  └────────────────────────┬────────────────────────────────────┘    │
+│                           ▼                                          │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │ CUDA Runtime + Driver                                        │    │
+│  │ CUDA 13.x · GPU scheduling · memory management             │    │
+│  └────────────────────────┬────────────────────────────────────┘    │
+│                           ▼                                          │
+│  ┌─────────────────────────────────────────────────────────────┐    │
+│  │ Hardware                                                     │    │
+│  │ RTX 5090: ~18K CUDA cores · 24 GB GDDR7 · Tensor cores     │    │
+│  └─────────────────────────────────────────────────────────────┘    │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+### Component Breakdown
+
+| Component           | What It Does                                                | Why You Need It                  |
+| ------------------- | ----------------------------------------------------------- | -------------------------------- |
+| **CUDA**            | General-purpose GPU programming                             | Foundation for everything        |
+| **cuDNN**           | Optimized neural network primitives (conv, attention, etc.) | 2–5× faster training/inference   |
+| **TensorRT**        | Model optimization + inference engine                       | 2–4× faster deployment inference |
+| **Triton** (NVIDIA) | Inference server for serving models at scale                | Production model serving         |
+| **Triton** (OpenAI) | GPU kernel compiler (write custom GPU kernels in Python)    | Research + custom ops            |
+| **cuBLAS**          | Optimized matrix multiplication                             | Core of all neural network math  |
+| **NCCL**            | Multi-GPU communication                                     | Distributed training (future)    |
+
+---
+
+## How to Set Up
+
+### 1. CUDA Toolkit (Inside WSL2)
+
+```bash
+# Check if CUDA is already available (from Windows driver passthrough)
+nvidia-smi
+nvcc --version  # CUDA compiler
+
+# If nvcc is not found, install CUDA toolkit
+# (nvidia-smi works from driver passthrough, but nvcc needs the toolkit)
+wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
+sudo dpkg -i cuda-keyring_1.1-1_all.deb
+sudo apt update
+sudo apt install -y cuda-toolkit-12-4
+
+# Add to PATH
+echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
+echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
+source ~/.bashrc
+
+# Verify
+nvcc --version
+# Should show: CUDA 12.4+
+```
+
+### 2. cuDNN
+
+```bash
+# cuDNN is usually bundled with PyTorch, but for custom builds:
+sudo apt install -y libcudnn8 libcudnn8-dev
+
+# Verify in Python
+python3 -c "import torch; print(f'cuDNN: {torch.backends.cudnn.version()}')"
+```
+
+### 3. TensorRT
+
+```bash
+# Install TensorRT
+pip install tensorrt
+
+# Or via apt for system-wide
+sudo apt install -y tensorrt
+
+# Verify
+python3 -c "import tensorrt; print(f'TensorRT: {tensorrt.__version__}')"
+```
+
+### 4. PyTorch with Full CUDA Support
+
+```bash
+# Create a research environment
+python3.12 -m venv ~/.venv-ml-research
+source ~/.venv-ml-research/bin/activate
+
+# Install PyTorch with CUDA 12.4
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+
+# Verify
+python3 -c "
+import torch
+print(f'PyTorch: {torch.__version__}')
+print(f'CUDA available: {torch.cuda.is_available()}')
+print(f'GPU: {torch.cuda.get_device_name(0)}')
+print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
+print(f'cuDNN: {torch.backends.cudnn.version()}')
+"
+```
+
+---
+
+## How to Use It
+
+### CUDA Programming Basics (Python)
+
+```python
+"""Your first CUDA program — matrix multiplication on the GPU."""
+import torch
+import time
+
+device = torch.device("cuda")
+
+# Create two large matrices on the GPU
+A = torch.randn(4096, 4096, device=device)
+B = torch.randn(4096, 4096, device=device)
+
+# Warm up
+_ = torch.mm(A, B)
+torch.cuda.synchronize()
+
+# Benchmark
+start = time.perf_counter()
+for _ in range(100):
+    C = torch.mm(A, B)
+torch.cuda.synchronize()
+elapsed = time.perf_counter() - start
+
+tflops = (2 * 4096**3 * 100) / elapsed / 1e12
+print(f"Matrix multiply: {elapsed:.3f}s for 100 iterations")
+print(f"Throughput: {tflops:.1f} TFLOPS")
+# Expected on RTX 5090: ~100–200 TFLOPS (FP32) or ~200–400 TFLOPS (FP16)
+```
+
+### TensorRT: Optimize a Model for Faster Inference
+
+```python
+"""Convert a PyTorch model to TensorRT for 2–4× faster inference."""
+import torch
+import torch_tensorrt
+
+# Load a model
+model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True).eval().cuda()
+
+# Compile with TensorRT
+trt_model = torch_tensorrt.compile(
+    model,
+    inputs=[torch_tensorrt.Input(shape=[1, 3, 224, 224], dtype=torch.float16)],
+    enabled_precisions={torch.float16},
+)
+
+# Benchmark
+input_tensor = torch.randn(1, 3, 224, 224, device="cuda", dtype=torch.float16)
+
+# PyTorch baseline
+with torch.no_grad():
+    start = time.perf_counter()
+    for _ in range(1000):
+        _ = model(input_tensor.float())
+    torch.cuda.synchronize()
+    pytorch_time = time.perf_counter() - start
+
+# TensorRT optimized
+with torch.no_grad():
+    start = time.perf_counter()
+    for _ in range(1000):
+        _ = trt_model(input_tensor)
+    torch.cuda.synchronize()
+    trt_time = time.perf_counter() - start
+
+print(f"PyTorch: {pytorch_time:.3f}s")
+print(f"TensorRT: {trt_time:.3f}s")
+print(f"Speedup: {pytorch_time/trt_time:.1f}×")
+```
+
+### Reproducing ML Research Papers
+
+Most ML papers provide CUDA-only code. The RTX 5090 lets you run them directly:
+
+```bash
+# Example: Run a recent paper's code
+git clone https://github.com/some-researcher/cool-new-model.git
+cd cool-new-model
+
+# Typical requirements
+pip install -r requirements.txt
+
+# Run training/evaluation
+python train.py --device cuda --epochs 10 --batch-size 16
+
+# This would NOT work on Mac (CUDA-only dependencies)
+```
+
+### Custom CUDA Kernels with Triton (OpenAI)
+
+```python
+"""Write a custom GPU kernel in Python using Triton."""
+import triton
+import triton.language as tl
+import torch
+
+@triton.jit
+def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
+    """Simple vector addition kernel running on GPU."""
+    pid = tl.program_id(axis=0)
+    block_start = pid * BLOCK_SIZE
+    offsets = block_start + tl.arange(0, BLOCK_SIZE)
+    mask = offsets < n_elements
+
+    x = tl.load(x_ptr + offsets, mask=mask)
+    y = tl.load(y_ptr + offsets, mask=mask)
+    output = x + y
+    tl.store(output_ptr + offsets, output, mask=mask)
+
+# Run the kernel
+n = 1_000_000
+x = torch.randn(n, device="cuda")
+y = torch.randn(n, device="cuda")
+output = torch.empty_like(x)
+
+grid = lambda meta: (triton.cdiv(n, meta["BLOCK_SIZE"]),)
+add_kernel[grid](x, y, output, n, BLOCK_SIZE=1024)
+
+print(f"Result correct: {torch.allclose(output, x + y)}")
+```
+
+---
+
+## Real-World Use Cases for Your Projects
+
+### 1. LysnrAI Inference Optimization
+
+Convert your Whisper or TTS models to TensorRT for even faster inference:
+
+```python
+# Optimize Whisper encoder with TensorRT
+# This can give another 2× speedup on top of whisper.cpp CUDA
+```
+
+### 2. Custom Embedding Models
+
+Train domain-specific embedding models for LysnrAI's semantic search:
+
+```python
+from sentence_transformers import SentenceTransformer
+
+model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
+
+# Encode your documents
+documents = ["meeting notes about Q1 budget", "grocery list for weekend", ...]
+embeddings = model.encode(documents, batch_size=64, show_progress_bar=True)
+# ~1000 documents/second on RTX 5090
+```
+
+### 3. Reproduce and Experiment with New Models
+
+When a new paper drops (GPT-5 alternatives, new TTS models, new architectures):
+
+```bash
+# Clone → install → run — no CUDA compatibility issues
+git clone https://github.com/new-cool-model
+pip install -r requirements.txt
+python evaluate.py --device cuda
+```
+
+### 4. Benchmarking and Profiling
+
+```python
+# Profile GPU usage during inference
+with torch.profiler.profile(
+    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
+    with_stack=True,
+) as prof:
+    model(input_tensor)
+
+print(prof.key_averages().table(sort_by="cuda_time_total"))
+```
+
+---
+
+## Benefits
+
+| Benefit                   | Description                                               |
+| ------------------------- | --------------------------------------------------------- |
+| **Industry standard**     | 95%+ of ML research and production uses CUDA              |
+| **Framework support**     | PyTorch, TensorFlow, JAX — all CUDA-first                 |
+| **Paper reproduction**    | Run any ML paper's code without compatibility issues      |
+| **TensorRT optimization** | 2–4× faster inference on optimized models                 |
+| **Custom kernels**        | Write GPU code in Python (Triton) or C++ (CUDA)           |
+| **Profiling tools**       | nvidia-smi, Nsight, PyTorch profiler — rich debugging     |
+| **Production parity**     | Code runs identically on cloud GPU instances (A100, H100) |
+
+---
+
+## Skills You'll Learn
+
+| Skill                     | What You'll Learn                                     | Career Value                       |
+| ------------------------- | ----------------------------------------------------- | ---------------------------------- |
+| **CUDA fundamentals**     | GPU memory model, kernel launches, synchronization    | Core ML infrastructure skill       |
+| **TensorRT**              | Model optimization, quantization, graph fusion        | Production ML deployment           |
+| **Triton kernels**        | Custom GPU programming in Python                      | Research + performance engineering |
+| **PyTorch profiling**     | Identifying bottlenecks, optimizing training loops    | Essential ML engineering           |
+| **cuDNN**                 | Optimized neural network operations                   | Framework-level understanding      |
+| **Mixed precision**       | FP16/BF16 training, loss scaling, numerical stability | Modern training standard           |
+| **GPU memory management** | Memory pools, caching allocator, OOM debugging        | Practical ML engineering           |
+| **Model optimization**    | Graph optimization, operator fusion, quantization     | ML deployment pipeline             |
+| **Benchmark design**      | Fair comparison methodology, statistical significance | Research methodology               |
+
+### Career Impact
+
+CUDA proficiency is one of the most valuable ML engineering skills. Here's where it maps:
+
+| Role                  | How CUDA Skills Apply                                 |
+| --------------------- | ----------------------------------------------------- |
+| **ML Engineer**       | Optimize training pipelines, reduce inference latency |
+| **AI Infrastructure** | Build and maintain GPU clusters, optimize throughput  |
+| **Research Engineer** | Implement custom operations for novel architectures   |
+| **MLOps**             | TensorRT deployment, GPU monitoring, autoscaling      |
+| **Full-Stack AI**     | End-to-end: train → optimize → serve → monitor        |
+
+---
+
+## Monitoring and Debugging
+
+```bash
+# Real-time GPU monitoring
+watch -n 0.5 nvidia-smi
+
+# Detailed GPU info
+nvidia-smi -q
+
+# GPU process list
+nvidia-smi pmon
+
+# Install nvtop (interactive GPU monitor)
+sudo apt install nvtop
+nvtop
+
+# PyTorch memory debugging
+python3 -c "
+import torch
+torch.cuda.memory_summary(device=0, abbreviated=True)
+"
+```
+
+---
+
+## Next Steps
+
+- [ ] Install CUDA toolkit + cuDNN in WSL2
+- [ ] Verify PyTorch CUDA with a matrix multiply benchmark
+- [ ] Run a TensorRT optimization on a simple model
+- [ ] Write a Triton kernel (vector add → custom attention)
+- [ ] Profile an Ollama inference request with nvidia-smi
+- [ ] Try reproducing a recent ML paper from GitHub
+- [ ] Benchmark: PyTorch vs TensorRT inference speed for Whisper
--- a/__LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md
@ -0,0 +1,325 @@
+# 6. Stable Diffusion / Image Generation
+
+> **RTX 5090:** 5–8 seconds per image (SDXL) vs ~30 seconds on Mac
+> **Why it matters:** Generate UI mockups, app icons, marketing assets, concept art — all locally and free
+
+---
+
+## What Is Stable Diffusion?
+
+Stable Diffusion is an open-source text-to-image AI model. You describe what you want in plain English, and it generates a high-quality image in seconds. It runs entirely on your GPU.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Stable Diffusion Pipeline                                            │
+│                                                                      │
+│  "A modern app dashboard with dark theme and blue accents"          │
+│       │                                                              │
+│       ▼                                                              │
+│  [CLIP Text Encoder] → text embeddings                              │
+│       │                                                              │
+│       ▼                                                              │
+│  [U-Net: iterative denoising × 20–50 steps]  ← GPU-intensive       │
+│       │                                                              │
+│       ▼                                                              │
+│  [VAE Decoder] → pixel image                                        │
+│       │                                                              │
+│       ▼                                                              │
+│  1024×1024 PNG image                                                 │
+│                                                                      │
+│  RTX 5090: ~5–8s per image (SDXL)                                   │
+│  Mac M4 Pro: ~25–35s per image (SDXL via MPS)                       │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Performance: Mac vs Razer
+
+| Model            | Resolution | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
+| ---------------- | ---------- | ---------------- | --------------------- | ------- |
+| SD 1.5           | 512×512    | ~8–12s           | ~1–2s                 | ~5×     |
+| SDXL             | 1024×1024  | ~25–35s          | ~5–8s                 | ~4×     |
+| SDXL Turbo       | 1024×1024  | ~8–12s           | ~1–3s                 | ~4×     |
+| FLUX.1 [dev]     | 1024×1024  | ~60–90s          | ~10–20s               | ~5×     |
+| FLUX.1 [schnell] | 1024×1024  | ~15–25s          | ~3–6s                 | ~4×     |
+
+### VRAM Requirements
+
+| Model             | VRAM Needed | Fits in 24 GB? |
+| ----------------- | ----------- | -------------- |
+| SD 1.5            | ~4 GB       | ✅ Easily      |
+| SDXL              | ~7 GB       | ✅ Easily      |
+| SDXL + ControlNet | ~10 GB      | ✅ Yes         |
+| FLUX.1 [dev]      | ~12 GB      | ✅ Yes         |
+| FLUX.1 + LoRA     | ~14 GB      | ✅ Yes         |
+| SD3 Medium        | ~12 GB      | ✅ Yes         |
+
+**24 GB VRAM means every current image model fits comfortably.**
+
+---
+
+## How to Set Up
+
+### Option A: ComfyUI (Node-Based — Recommended)
+
+ComfyUI is a powerful node-based UI for Stable Diffusion. It's flexible, efficient, and well-suited for automation.
+
+```bash
+# Clone ComfyUI
+cd ~
+git clone https://github.com/comfyanonymous/ComfyUI.git
+cd ComfyUI
+
+# Create venv and install
+python3.12 -m venv venv
+source venv/bin/activate
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+pip install -r requirements.txt
+
+# Download SDXL model (~6.5 GB)
+mkdir -p models/checkpoints
+cd models/checkpoints
+wget "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
+
+# Start ComfyUI
+cd ~/ComfyUI
+python main.py
+
+# Open in browser: http://localhost:8188
+# Accessible from Windows browser via WSL2 port forwarding
+```
+
+### Option B: Automatic1111 (Classic Web UI)
+
+```bash
+# Clone A1111
+cd ~
+git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
+cd stable-diffusion-webui
+
+# Launch (auto-installs deps on first run)
+bash webui.sh --listen --xformers
+
+# Open: http://localhost:7860
+```
+
+### Option C: Python Script (No UI)
+
+```python
+"""Generate images with Stable Diffusion from Python."""
+import torch
+from diffusers import StableDiffusionXLPipeline
+
+# Load SDXL
+pipe = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0",
+    torch_dtype=torch.float16,
+    variant="fp16",
+).to("cuda")
+
+# Enable memory optimizations
+pipe.enable_xformers_memory_efficient_attention()
+
+# Generate
+image = pipe(
+    prompt="A sleek dark-themed mobile app dashboard showing AI brain categories, "
+           "neon blue and teal accents, glassmorphism cards, modern UI design",
+    negative_prompt="blurry, low quality, text, watermark",
+    num_inference_steps=30,
+    guidance_scale=7.5,
+    width=1024,
+    height=1024,
+).images[0]
+
+image.save("dashboard_concept.png")
+print("Generated: dashboard_concept.png")
+```
+
+### Batch Generation Script
+
+```python
+"""Generate multiple variations of an image concept."""
+import torch
+from diffusers import StableDiffusionXLPipeline
+import time
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0",
+    torch_dtype=torch.float16,
+    variant="fp16",
+).to("cuda")
+pipe.enable_xformers_memory_efficient_attention()
+
+prompts = [
+    "App icon for LysnrAI, microphone with sound waves, dark background, modern gradient",
+    "App icon for MindLyst, brain with neural connections, dark background, blue teal gradient",
+    "Feature graphic for LysnrAI, voice waveform visualization, dark theme, 1024x500",
+    "Splash screen, abstract sound waves, dark navy background, teal highlights",
+    "Dashboard mockup, dark theme, cards with charts, sidebar navigation, modern UI",
+]
+
+for i, prompt in enumerate(prompts):
+    start = time.time()
+    image = pipe(
+        prompt=prompt,
+        negative_prompt="blurry, low quality, text, watermark, ugly",
+        num_inference_steps=30,
+        guidance_scale=7.5,
+        width=1024,
+        height=1024,
+        generator=torch.Generator("cuda").manual_seed(42 + i),
+    ).images[0]
+
+    elapsed = time.time() - start
+    filename = f"generated_{i:02d}.png"
+    image.save(filename)
+    print(f"[{i+1}/{len(prompts)}] {filename} ({elapsed:.1f}s)")
+```
+
+---
+
+## Real-World Use Cases for Your Projects
+
+### 1. App Store Assets
+
+Generate icon concepts, feature graphics, and screenshot backgrounds:
+
+```python
+# LysnrAI app icon variations
+prompts = [
+    "Minimal app icon, single microphone, dark navy background, glowing teal outline, flat design",
+    "App icon, sound wave forming a brain shape, dark background, blue to teal gradient",
+    "App icon, headphones with AI sparkles, dark background, modern glassmorphism",
+]
+```
+
+### 2. UI/UX Mockup Exploration
+
+Rapidly prototype visual ideas before coding:
+
+```python
+# Generate dashboard layout concepts
+prompt = """
+Web dashboard design, dark theme (#06070A background), sidebar navigation,
+main content area with 3 cards showing brain categories,
+teal and blue accent colors, modern glassmorphism,
+high fidelity UI design, clean typography
+"""
+```
+
+### 3. Marketing and Social Media
+
+```python
+# Blog post hero images
+prompt = "Abstract AI visualization, neural network nodes connected by light beams, dark background, blue and teal colors, cinematic lighting"
+
+# Social media posts
+prompt = "Infographic style, voice AI assistant concept, microphone with sound waves transforming into text, dark modern design"
+```
+
+### 4. Concept Art for Features
+
+```python
+# Visualize MindLyst "Brains" concept
+prompts = [
+    "Digital brain labeled 'Work', organized files and charts floating around it, dark theme, blue glow",
+    "Digital brain labeled 'Health', fitness and medical icons floating around it, dark theme, green glow",
+    "Digital brain labeled 'Finance', money and graphs floating around it, dark theme, gold glow",
+]
+```
+
+### 5. ControlNet (Image-Guided Generation)
+
+Use an existing image as a structural guide:
+
+```python
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
+
+# Load ControlNet for edge-guided generation
+controlnet = ControlNetModel.from_pretrained(
+    "diffusers/controlnet-canny-sdxl-1.0",
+    torch_dtype=torch.float16,
+).to("cuda")
+
+# Use your existing dashboard screenshot as a structural guide
+# Generate a redesigned version with new visual style
+```
+
+---
+
+## Benefits
+
+| Benefit                     | Description                                                      |
+| --------------------------- | ---------------------------------------------------------------- |
+| **Speed**                   | 5–8s per image (vs 30s on Mac or waiting for cloud APIs)         |
+| **Cost**                    | $0 per image (vs $0.02–0.08 per image for DALL-E 3 / Midjourney) |
+| **Privacy**                 | Generated locally — no images sent to cloud                      |
+| **Control**                 | Full parameter control: steps, guidance, seed, resolution        |
+| **Reproducibility**         | Same seed = same image every time                                |
+| **Customization**           | LoRA fine-tuning for brand-specific styles                       |
+| **Batch capability**        | Generate hundreds of variations overnight                        |
+| **No content restrictions** | No cloud content policies limiting your output                   |
+
+### Cost Comparison (100 images)
+
+| Method                           | Cost         | Time           |
+| -------------------------------- | ------------ | -------------- |
+| DALL-E 3 API                     | $4–8         | ~5 min (cloud) |
+| Midjourney                       | $10–30/month | ~5 min (cloud) |
+| Mac M4 Pro (SDXL, local)         | $0           | ~40–55 min     |
+| **Razer RTX 5090 (SDXL, local)** | **$0**       | **~8–13 min**  |
+
+---
+
+## Skills You'll Learn
+
+| Skill                     | What You'll Learn                                   | Career Value                 |
+| ------------------------- | --------------------------------------------------- | ---------------------------- |
+| **Diffusion models**      | How iterative denoising generates images from noise | Core generative AI knowledge |
+| **Prompt engineering**    | Crafting effective text prompts for visual output   | Universal AI skill           |
+| **ControlNet**            | Structural guidance for image generation            | Advanced image AI            |
+| **LoRA training**         | Fine-tuning image models for specific styles        | Generative AI customization  |
+| **ComfyUI workflows**     | Node-based visual programming for AI pipelines      | Production image generation  |
+| **Image post-processing** | Upscaling, inpainting, outpainting techniques       | Digital content creation     |
+| **VRAM optimization**     | Model offloading, attention optimization, tiling    | GPU memory management        |
+| **Batch automation**      | Scripting large-scale image generation              | Production engineering       |
+| **Model selection**       | SD 1.5 vs SDXL vs FLUX — trade-offs                 | Practical AI judgment        |
+
+---
+
+## Advanced: FLUX.1 (Latest Generation)
+
+FLUX.1 is the newest open-source image model (from Black Forest Labs / Stability AI alumni). Better quality than SDXL.
+
+```bash
+# FLUX.1 [schnell] — fast, 4 steps
+# FLUX.1 [dev] — high quality, 50 steps
+
+# Fits in 24 GB VRAM with FP16
+python3 -c "
+from diffusers import FluxPipeline
+import torch
+
+pipe = FluxPipeline.from_pretrained(
+    'black-forest-labs/FLUX.1-schnell',
+    torch_dtype=torch.float16,
+).to('cuda')
+
+image = pipe('A futuristic AI assistant interface, holographic UI, dark theme').images[0]
+image.save('flux_test.png')
+"
+```
+
+---
+
+## Next Steps
+
+- [ ] Install ComfyUI in WSL2 and verify CUDA acceleration
+- [ ] Download SDXL base model and generate first test image
+- [ ] Generate app icon concepts for LysnrAI and MindLyst
+- [ ] Try ControlNet with an existing dashboard screenshot
+- [ ] Experiment with FLUX.1 [schnell] for higher quality
+- [ ] Create a batch script for generating marketing assets
+- [ ] Build a style LoRA trained on your brand colors/aesthetic
--- a/__LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md
@ -0,0 +1,373 @@
+# 7. Multi-GPU Workloads (Future Path)
+
+> **RTX 5090:** Your first serious CUDA GPU — a stepping stone to multi-GPU and cloud GPU workflows
+> **Why it matters:** The skills, code, and workflows you build on one GPU translate directly to multi-GPU and cloud infrastructure
+
+---
+
+## Why Think About Multi-GPU Now?
+
+You don't need multiple GPUs today. But learning on a single RTX 5090 builds skills that scale directly:
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ GPU Scaling Path                                                     │
+│                                                                      │
+│  TODAY                                                               │
+│  ┌─────────────────────────────────┐                                │
+│  │ RTX 5090 Laptop (24 GB VRAM)   │ ← You are here                 │
+│  │ Single GPU, local inference     │                                │
+│  │ Fine-tuning up to 13B models   │                                │
+│  └────────────────┬────────────────┘                                │
+│                   │                                                  │
+│  NEAR FUTURE      ▼                                                  │
+│  ┌─────────────────────────────────┐                                │
+│  │ Desktop + eGPU or 2× GPU tower │                                │
+│  │ 48–96 GB total VRAM            │                                │
+│  │ Fine-tune 70B models           │                                │
+│  │ Run 2–3 models simultaneously  │                                │
+│  └────────────────┬────────────────┘                                │
+│                   │                                                  │
+│  CLOUD / PROD     ▼                                                  │
+│  ┌─────────────────────────────────┐                                │
+│  │ Cloud GPU instances             │                                │
+│  │ A100/H100 × 2–8 (80–640 GB)   │                                │
+│  │ Train large models              │                                │
+│  │ Serve at scale                  │                                │
+│  └─────────────────────────────────┘                                │
+│                                                                      │
+│  SAME CUDA code works at every level ↑                              │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## What Multi-GPU Enables
+
+| Capability            | Single GPU (24 GB) | 2× GPU (48 GB)  | 4× GPU (96 GB)  | Cloud (640 GB) |
+| --------------------- | ------------------ | --------------- | --------------- | -------------- |
+| 7B inference          | ✅ Very fast       | ✅ + concurrent | ✅ + concurrent | ✅ at scale    |
+| 32B inference         | ✅ Fits in VRAM    | ✅ Very fast    | ✅ Very fast    | ✅ at scale    |
+| 70B inference         | ⚠️ Partial GPU     | ✅ Full GPU     | ✅ Very fast    | ✅ at scale    |
+| 7B fine-tune (QLoRA)  | ✅ Comfortable     | ✅ Faster       | ✅ Faster       | ✅ Fastest     |
+| 13B fine-tune (QLoRA) | ✅ Fits            | ✅ Comfortable  | ✅ Fast         | ✅ Fastest     |
+| 70B fine-tune (QLoRA) | ❌ OOM             | ⚠️ Tight        | ✅ Fits         | ✅ Comfortable |
+| 7B full fine-tune     | ❌ OOM             | ⚠️ Tight        | ✅ Fits         | ✅ Comfortable |
+| Serve 3+ models       | ❌ VRAM limit      | ✅ Yes          | ✅ Yes          | ✅ Yes         |
+| SDXL + LLM concurrent | ⚠️ Tight           | ✅ Yes          | ✅ Yes          | ✅ Yes         |
+
+---
+
+## Expansion Options
+
+### Option 1: eGPU (Thunderbolt/USB4)
+
+Connect an external GPU to your Razer Blade via Thunderbolt 4:
+
+```
+┌──────────────────────────────────┐     Thunderbolt 4     ┌──────────────────────┐
+│ Razer Blade 18                   │◄═══(~40 Gbps)════════►│ eGPU Enclosure       │
+│ RTX 5090 (24 GB) — internal     │                        │ RTX 4090 (24 GB)     │
+│                                  │                        │ or RTX 5090 (24 GB)  │
+└──────────────────────────────────┘                        └──────────────────────┘
+
+Total VRAM: 48 GB (24 + 24)
+Limitation: Thunderbolt bandwidth (~40 Gbps) is slower than PCIe 5.0 (~128 Gbps)
+Best for: Model serving (latency-tolerant), not training (bandwidth-sensitive)
+```
+
+**Recommended eGPU enclosures:**
+| Enclosure | Price | GPU Support |
+|-----------|-------|-------------|
+| Razer Core X | ~$300 | Full-length desktop GPUs |
+| Sonnet Breakaway | ~$250 | Most desktop GPUs |
+| Akitio Node | ~$200 | Compact form factor |
+
+### Option 2: Desktop Build (Maximum Performance)
+
+Build a dedicated GPU workstation:
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Desktop GPU Workstation                                              │
+│                                                                      │
+│  Motherboard: ASUS/MSI with 2–4× PCIe 5.0 x16 slots               │
+│  CPU: Intel i9 or AMD Ryzen 9 (enough PCIe lanes)                  │
+│  RAM: 128 GB DDR5                                                    │
+│  GPU 1: RTX 5090 (24 GB GDDR7)                                     │
+│  GPU 2: RTX 5090 (24 GB GDDR7)                                     │
+│  Total VRAM: 48 GB                                                   │
+│  PSU: 1200W+ (two 5090s draw ~600W under load)                     │
+│                                                                      │
+│  Cost: ~$5,000–7,000                                                │
+│  Performance: Near-datacenter for most workloads                    │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+### Option 3: Cloud GPU (On-Demand)
+
+Rent GPU time when you need it:
+
+| Provider    | GPU       | VRAM   | Price/Hour | Best For             |
+| ----------- | --------- | ------ | ---------- | -------------------- |
+| Lambda Labs | A100 80GB | 80 GB  | ~$1.10     | Training             |
+| RunPod      | A100 80GB | 80 GB  | ~$1.64     | Training + inference |
+| Vast.ai     | RTX 4090  | 24 GB  | ~$0.30     | Budget inference     |
+| AWS (p4d)   | A100 ×8   | 640 GB | ~$32       | Large-scale training |
+| Together AI | H100      | 80 GB  | ~$2.50     | Fine-tuning API      |
+
+**Your RTX 5090 code runs identically on cloud GPUs** — same PyTorch, same CUDA.
+
+---
+
+## How to Prepare Now (Single GPU)
+
+### 1. Write GPU-Agnostic Code
+
+Structure your code so it works with any number of GPUs:
+
+```python
+"""GPU-agnostic model loading — works with 1 or N GPUs."""
+import torch
+
+def get_device():
+    """Select the best available device."""
+    if torch.cuda.is_available():
+        # Multi-GPU: use DataParallel or DistributedDataParallel
+        if torch.cuda.device_count() > 1:
+            print(f"Using {torch.cuda.device_count()} GPUs")
+            return "cuda"  # PyTorch handles multi-GPU distribution
+        return "cuda:0"
+    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+
+device = get_device()
+model = MyModel().to(device)
+
+# Wrap for multi-GPU (no-op on single GPU)
+if torch.cuda.device_count() > 1:
+    model = torch.nn.DataParallel(model)
+```
+
+### 2. Learn Model Parallelism Concepts
+
+```python
+"""Pipeline parallelism — split model layers across GPUs."""
+# Example: split a large model across 2 GPUs
+
+# GPU 0: layers 0–15
+# GPU 1: layers 16–31
+
+# With Hugging Face Accelerate:
+from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+
+with init_empty_weights():
+    model = AutoModelForCausalLM.from_config(config)
+
+model = load_checkpoint_and_dispatch(
+    model,
+    checkpoint="./model-weights",
+    device_map="auto",  # Automatically distributes across available GPUs
+)
+```
+
+### 3. Ollama Multi-GPU (Already Supported)
+
+Ollama can split a single large model across multiple GPUs:
+
+```bash
+# When you have 2 GPUs, Ollama auto-detects and splits
+# A 70B model: 24 GB on GPU 0, 16 GB on GPU 1, rest in RAM
+
+# Or manually control GPU assignment
+CUDA_VISIBLE_DEVICES=0,1 ollama serve
+
+# Check GPU allocation
+nvidia-smi  # Shows VRAM usage per GPU
+```
+
+### 4. vLLM (High-Throughput Inference Server)
+
+```bash
+# vLLM supports tensor parallelism across GPUs
+pip install vllm
+
+# Serve a model across 2 GPUs
+python -m vllm.entrypoints.openai.api_server \
+    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
+    --tensor-parallel-size 2 \
+    --gpu-memory-utilization 0.9
+
+# API compatible with OpenAI format
+curl http://localhost:8000/v1/completions -d '{
+    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
+    "prompt": "Hello",
+    "max_tokens": 100
+}'
+```
+
+---
+
+## Scaling Scenarios for Your Projects
+
+### Scenario 1: LysnrAI at Scale
+
+```
+TODAY (1× RTX 5090):
+  - 1 user, local inference
+  - Whisper + Ollama + TTS, one at a time
+
+FUTURE (2× GPU desktop):
+  - GPU 0: Ollama (always-on coding model)
+  - GPU 1: Whisper + TTS (dedicated)
+  - Run all 3 workloads concurrently
+
+PRODUCTION (Cloud):
+  - vLLM serving on A100
+  - Whisper on dedicated GPU
+  - TTS on dedicated GPU
+  - Handles 100+ concurrent users
+```
+
+### Scenario 2: Fine-Tuning Larger Models
+
+```
+TODAY (24 GB VRAM):
+  - QLoRA 7B–13B models
+  - Training time: 1–4 hours
+
+FUTURE (48 GB VRAM):
+  - QLoRA 70B models
+  - LoRA FP16 32B models
+  - Training time: 4–12 hours
+
+CLOUD (80+ GB VRAM):
+  - Full fine-tune 7B–13B models
+  - QLoRA 70B+ models
+  - Training time: hours with H100
+```
+
+### Scenario 3: Image + Text Generation Pipeline
+
+```
+TODAY (1× GPU):
+  - SDXL OR LLM, not both at once (VRAM constraint)
+
+FUTURE (2× GPU):
+  - GPU 0: LLM (Ollama, 32B model, ~19 GB)
+  - GPU 1: SDXL + ControlNet (~10 GB)
+  - Generate images guided by LLM descriptions
+
+PRODUCTION:
+  - Automated content pipeline:
+    LLM writes description → SDXL generates image → Whisper adds voiceover
+```
+
+---
+
+## Benefits of Starting Single-GPU
+
+| Benefit                  | Description                                                         |
+| ------------------------ | ------------------------------------------------------------------- |
+| **Code portability**     | CUDA code runs the same on 1, 2, or 100 GPUs                        |
+| **Skill foundation**     | Memory management, profiling, optimization skills transfer directly |
+| **Cost efficiency**      | Learn on local hardware ($0) before renting cloud ($$$)             |
+| **Workflow development** | Build training pipelines, inference servers, batch scripts now      |
+| **Hardware literacy**    | Understand VRAM limits, bandwidth, PCIe bottlenecks                 |
+
+---
+
+## Skills You'll Build Toward
+
+| Skill                    | Single GPU (Now)   | Multi-GPU (Future)           | Career Impact           |
+| ------------------------ | ------------------ | ---------------------------- | ----------------------- |
+| **CUDA programming**     | Kernels, memory    | NCCL, all-reduce             | ML Infrastructure       |
+| **Model parallelism**    | Understand concept | Implement tensor/pipeline    | Senior ML Engineer      |
+| **Distributed training** | Data loading       | Multi-node coordination      | ML Platform Engineer    |
+| **Inference serving**    | Ollama, local API  | vLLM, Triton, load balancing | MLOps / Production      |
+| **GPU monitoring**       | nvidia-smi, nvtop  | Cluster monitoring           | DevOps / SRE            |
+| **Cost optimization**    | VRAM budget        | Spot instances, right-sizing | FinOps / ML Engineering |
+| **Batch scheduling**     | Cron jobs          | Kubernetes, Slurm            | ML Platform             |
+
+### Learning Path
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ Skills Progression                                                   │
+│                                                                      │
+│  Level 1 (Now — RTX 5090 Single GPU)                                │
+│  ├── PyTorch + CUDA basics                                          │
+│  ├── Ollama model serving                                           │
+│  ├── QLoRA fine-tuning 7B models                                    │
+│  ├── nvidia-smi monitoring                                          │
+│  └── TensorRT basic optimization                                    │
+│                                                                      │
+│  Level 2 (6 months — Same GPU, deeper skills)                       │
+│  ├── Custom Triton kernels                                          │
+│  ├── vLLM inference server                                          │
+│  ├── Advanced quantization (AWQ, GPTQ)                              │
+│  ├── Profiling and optimization                                     │
+│  └── Model merging and adapter stacking                             │
+│                                                                      │
+│  Level 3 (12 months — Multi-GPU or Cloud)                           │
+│  ├── Multi-GPU inference (tensor parallelism)                       │
+│  ├── Distributed training (DDP, FSDP)                               │
+│  ├── Cloud GPU workflow (Lambda, RunPod)                             │
+│  ├── Production serving with autoscaling                            │
+│  └── NCCL and multi-node communication                              │
+│                                                                      │
+│  Level 4 (Future — Production ML)                                    │
+│  ├── Kubernetes + GPU scheduling                                    │
+│  ├── Model serving at scale (thousands of requests/sec)             │
+│  ├── Training pipelines with experiment tracking                    │
+│  ├── Custom model architectures                                     │
+│  └── Open-source ML contributions                                   │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Cost Planning
+
+### Single GPU (Current)
+
+| Item                               | Cost        | Status    |
+| ---------------------------------- | ----------- | --------- |
+| Razer Blade 18 RTX 5090            | $5,200      | Purchased |
+| Electricity (~200W avg, 8 hrs/day) | ~$15/month  | Ongoing   |
+| **Total first year**               | **~$5,380** |           |
+
+### Desktop Upgrade (Future)
+
+| Item                       | Estimated Cost |
+| -------------------------- | -------------- |
+| Desktop tower + PSU + RAM  | ~$1,500        |
+| RTX 5090 desktop GPU       | ~$2,000        |
+| Second RTX 5090 (optional) | ~$2,000        |
+| **Total (2× GPU desktop)** | **~$5,500**    |
+
+### Cloud Alternative (Per-Use)
+
+| Usage Pattern           | Monthly Cost |
+| ----------------------- | ------------ |
+| 10 hours/month on A100  | ~$11         |
+| 40 hours/month on A100  | ~$44         |
+| 160 hours/month on A100 | ~$176        |
+| Always-on A100          | ~$792        |
+
+**Break-even vs desktop:** ~12–18 months at heavy usage (40+ hours/month).
+
+---
+
+## Next Steps
+
+- [ ] Write all training and inference scripts to be GPU-count agnostic
+- [ ] Install and test vLLM on single GPU with Llama 3.1 8B
+- [ ] Practice monitoring GPU memory and compute utilization
+- [ ] Try model offloading: run a 70B model with partial CPU/GPU split
+- [ ] Explore Lambda Labs or RunPod for a cloud GPU test ($5–10 experiment)
+- [ ] Benchmark single GPU throughput to establish a baseline for comparison
--- a/__LOCAL_LLMs/windows_specific/capabilities/README.md
+++ b/__LOCAL_LLMs/windows_specific/capabilities/README.md
@ -0,0 +1,52 @@
+# RTX 5090 Capabilities — Deep Dive Guides
+
+> What you can do with the Razer Blade 18's RTX 5090 (24 GB GDDR7) that you can't (or can't do well) on the Mac.
+
+Each guide covers: **what it is → how to use it → real-world use cases → benefits → skills you'll learn.**
+
+---
+
+## Guides
+
+| #                                       | Capability                        | Key Benefit                          | Skill Level  |
+| --------------------------------------- | --------------------------------- | ------------------------------------ | ------------ |
+| [01](01-gpu-inference-speed.md)         | **GPU Inference Speed**           | 2–4× faster LLM responses            | Beginner     |
+| [02](02-whisper-batch-transcription.md) | **Whisper Batch Transcription**   | Hours of audio in minutes            | Beginner     |
+| [03](03-tts-generation-at-scale.md)     | **TTS Generation at Scale**       | Faster-than-realtime voice synthesis | Beginner     |
+| [04](04-fine-tuning-training.md)        | **Fine-Tuning / Training**        | Customize models on your own data    | Intermediate |
+| [05](05-cuda-tensorrt-ml-research.md)   | **CUDA / TensorRT / ML Research** | Full NVIDIA ML toolchain             | Intermediate |
+| [06](06-stable-diffusion-image-gen.md)  | **Stable Diffusion / Image Gen**  | 5–8s per image, unlimited free       | Beginner     |
+| [07](07-multi-gpu-workloads.md)         | **Multi-GPU Workloads (Future)**  | Scaling path to production           | Advanced     |
+
+---
+
+## Suggested Learning Order
+
+```
+Week 1:  01 (Inference) → 02 (Whisper) → 03 (TTS)
+         Get familiar with the GPU, benchmark your models
+
+Week 2:  06 (Stable Diffusion)
+         Set up ComfyUI, generate app assets
+
+Week 3:  04 (Fine-Tuning)
+         QLoRA your first 7B model on your own code
+
+Week 4:  05 (CUDA / TensorRT)
+         Deeper GPU programming, profiling, optimization
+
+Ongoing: 07 (Multi-GPU)
+         Reference as you plan scaling
+```
+
+---
+
+## Prerequisites
+
+All guides assume you've completed the [Windows setup](../setup-guide.md):
+
+- NVIDIA drivers installed (Windows)
+- Ollama installed and running (Windows)
+- WSL2 Ubuntu 24.04 set up
+- Repo cloned, `setup-tts.sh` completed
+- Dashboard running at `http://localhost:3000`