docs(local-llms): add 7 RTX 5090 capability deep-dive guides

New capabilities/ subfolder with detailed guides:
- 01: GPU inference speed (benchmarks, Ollama tuning, API usage)
- 02: Whisper batch transcription (scripts, Python integration, use cases)
- 03: TTS generation at scale (Orpheus + Qwen3, batch scripts, voice cloning)
- 04: Fine-tuning / training (LoRA, QLoRA, data prep, Ollama export)
- 05: CUDA / TensorRT / ML research (toolchain setup, Triton kernels, profiling)
- 06: Stable Diffusion / image gen (ComfyUI, SDXL, FLUX, batch generation)
- 07: Multi-GPU workloads (scaling path, eGPU, cloud, cost planning)
- README: index with learning order and prerequisites

Each guide covers: what it is, how to use it, benefits, skills to learn
This commit is contained in:
saravanakumardb1 2026-02-21 20:36:21 -08:00
parent 1650e0da6c
commit 6d18344fe0
8 changed files with 2237 additions and 0 deletions

View File

@ -0,0 +1,174 @@
# 1. Raw GPU Inference Speed
> **RTX 5090:** 24× faster inference on all models ≤32B compared to Mac M4 Pro
> **Why it matters:** Faster coding suggestions, faster conversations, faster iteration
---
## What Is GPU Inference?
When you run a model like `qwen2.5-coder:32b` through Ollama, the GPU does the heavy lifting — multiplying billions of numbers (matrix operations) to generate each token of the response. The speed of this process depends on:
1. **VRAM bandwidth** — how fast data moves within the GPU
2. **Compute cores** — how many operations run in parallel
3. **VRAM capacity** — whether the full model fits without spilling to CPU RAM
```
┌──────────────────────────────────────────────────────────────────────┐
│ Token Generation Pipeline │
│ │
│ Prompt → [Tokenize] → [GPU: Matrix Multiply] → [Sample] → Token │
│ ▲ │
│ │ │
│ This is the bottleneck. │
│ RTX 5090 does this 24× faster. │
└──────────────────────────────────────────────────────────────────────┘
```
---
## Performance: Mac vs Razer
| Model | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
| ------------------------- | ---------------- | --------------------- | ------- |
| llama3.1:8b (4.9 GB) | ~5070 tok/s | ~100150 tok/s | ~2× |
| qwen2.5-coder:7b (4.7 GB) | ~4060 tok/s | ~80120 tok/s | ~2× |
| qwen2.5-coder:32b (19 GB) | ~1525 tok/s | ~4060 tok/s | ~2.5× |
| deepseek-r1:32b (19 GB) | ~1525 tok/s | ~4060 tok/s | ~2.5× |
| sematre/orpheus:en (4 GB) | ~realtime | ~23× realtime | ~2.5× |
### Why the RTX 5090 Is Faster
```
┌─────────────────────────┬──────────────────────┬──────────────────────────┐
│ Metric │ Mac M4 Pro │ RTX 5090 │
├─────────────────────────┼──────────────────────┼──────────────────────────┤
│ GPU memory bandwidth │ ~273 GB/s (shared) │ ~1,000+ GB/s (GDDR7) │
│ Compute cores │ 20 Metal cores │ ~18,000 CUDA cores │
│ Tensor cores │ None (Neural Engine) │ 5th/6th gen Tensor cores │
│ FP16 throughput │ ~25 TFLOPS │ ~200+ TFLOPS │
│ Model in memory? │ Yes (unified 48 GB) │ Yes (24 GB VRAM) │
└─────────────────────────┴──────────────────────┴──────────────────────────┘
```
The RTX 5090's GDDR7 bandwidth is ~4× higher, and it has ~8× the raw FP16 compute throughput. The actual speedup is 24× (not 8×) because inference is mostly **memory-bandwidth bound**, not compute-bound — the GPU spends most of its time reading model weights, not computing.
---
## How to Use It
### Basic: Ollama (Already Set Up)
Ollama runs natively on Windows and uses CUDA automatically. No extra config needed.
```bash
# From WSL2 or Windows terminal
ollama run qwen2.5-coder:32b "Write a Fastify route that validates input with Zod"
```
### Interactive Coding Assistant
```bash
# Start a conversation with the 32B coding model
ollama run qwen2.5-coder:32b
# Or use the 7B model for quick questions (faster response start)
ollama run qwen2.5-coder:7b
```
### From the Dashboard
```bash
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
bash start-dashboard.sh
# Open http://localhost:3000 — model status + inference visible
```
### Benchmark Your Actual Speed
```bash
# Quick benchmark — measure tokens per second
time ollama run qwen2.5-coder:7b "Write a Python function that implements binary search" --verbose 2>&1 | tail -5
# Compare models
for model in llama3.1:8b qwen2.5-coder:7b qwen2.5-coder:32b; do
echo "=== $model ==="
ollama run "$model" "Hello world" --verbose 2>&1 | grep "eval rate"
done
```
### API Access (for Scripts/Apps)
```bash
# Ollama exposes a REST API at localhost:11434
curl -s http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:32b",
"prompt": "Explain CUDA memory hierarchy in 3 sentences",
"stream": false
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
```
---
## Benefits
### For Your LysnrAI / MindLyst Projects
- **Faster code generation** — 32B model responses in ~0.51.5s vs ~24s on Mac
- **More context in less time** — can process longer prompts without waiting
- **Better for agentic workflows** — LangGraph agents that call LLMs multiple times per step run 24× faster end-to-end
- **Batch processing** — generate embeddings, summaries, or classifications for hundreds of items quickly
### For Daily Coding
- **Near-instant small model responses** — 7B at 80120 tok/s feels like reading speed
- **Viable 32B coding assistant** — 4060 tok/s is fast enough for real-time pair programming
- **DeepSeek-R1 reasoning** — chain-of-thought at 4060 tok/s makes complex reasoning practical
---
## Skills You'll Learn
| Skill | What You'll Learn | Why It Matters |
| -------------------------- | -------------------------------------------------------------------- | -------------------------------------- |
| **GPU memory management** | How VRAM allocation works, model offloading, quantization trade-offs | Core ML engineering skill |
| **CUDA profiling** | Using `nvidia-smi`, `nvtop`, watching GPU utilization | Essential for optimizing AI workloads |
| **Quantization** | Q4 vs Q8 vs FP16 — speed/quality trade-offs | Industry-standard model deployment |
| **Inference optimization** | Batch size, context length, KV cache tuning | Key for production AI systems |
| **Model selection** | When to use 7B vs 32B vs 70B for different tasks | Practical AI engineering judgment |
| **REST API integration** | Building apps that call local LLM APIs | Directly applicable to LysnrAI backend |
---
## Advanced: Tuning Ollama for Performance
```bash
# Set number of GPU layers (default: all)
# Useful if you want to run 2 models with partial GPU offload
OLLAMA_NUM_GPU=999 ollama serve
# Monitor GPU during inference
watch -n 0.5 nvidia-smi
# Or install nvtop for a richer GPU monitor
sudo apt install nvtop
nvtop
```
### Ollama Environment Variables
| Variable | Default | Description |
| -------------------------- | ------- | --------------------------------------- |
| `OLLAMA_NUM_PARALLEL` | 1 | Concurrent request slots |
| `OLLAMA_MAX_LOADED_MODELS` | 1 | Models kept in VRAM simultaneously |
| `OLLAMA_FLASH_ATTENTION` | true | Use flash attention (faster, less VRAM) |
| `OLLAMA_GPU_OVERHEAD` | 0 | Reserve VRAM (bytes) for other apps |
---
## Next Steps
- [ ] Benchmark all 5 models on the Razer and record actual tok/s
- [ ] Try `OLLAMA_NUM_PARALLEL=4` for concurrent requests
- [ ] Experiment with `OLLAMA_MAX_LOADED_MODELS=2` to keep 7B + 32B hot
- [ ] Build a simple script that compares Mac vs Razer inference times

View File

@ -0,0 +1,306 @@
# 2. Whisper Batch Transcription
> **RTX 5090:** 815× realtime transcription vs 24× on Mac
> **Why it matters:** Hours of audio transcribed in minutes — unlocks bulk audio workflows
---
## What Is Whisper?
[Whisper](https://github.com/openai/whisper) is OpenAI's open-source speech-to-text model. We use [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — a high-performance C/C++ implementation that supports CUDA GPU acceleration.
The `large-v3-turbo` model (~1.5 GB) delivers near-human accuracy across 99 languages.
```
┌──────────────────────────────────────────────────────────────────────┐
│ Whisper Transcription Pipeline │
│ │
│ Audio File (.wav/.mp3) │
│ │ │
│ ▼ │
│ [FFmpeg: decode + resample to 16kHz mono] │
│ │ │
│ ▼ │
│ [Whisper: Mel spectrogram → Encoder → Decoder → Text] │
│ │ ▲ │
│ │ │ GPU accelerates this (CUDA or Metal) │
│ ▼ │
│ Transcript (.txt / .srt / .vtt / .json) │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
---
## Performance: Mac vs Razer
| Audio Length | Mac M4 Pro (Metal) | Razer RTX 5090 (CUDA) | Speedup |
| ------------ | ------------------ | --------------------- | ------- |
| 1 minute | ~1530s | ~48s | ~3× |
| 10 minutes | ~2.55 min | ~4080s | ~3× |
| 1 hour | ~1530 min | ~48 min | ~3× |
| 10 hours | ~2.55 hours | ~4080 min | ~34× |
| 100 hours | ~12 days | ~713 hours | ~34× |
### Realtime Multiplier
| Machine | Speed | Meaning |
| -------------- | --------------- | ------------------------ |
| Mac M4 Pro | ~24× realtime | 1 hour audio → 1530 min |
| Razer RTX 5090 | ~815× realtime | 1 hour audio → 48 min |
---
## How to Use It
### Single File Transcription
```bash
# Basic transcription
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
# With timestamps (SRT format for subtitles)
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -osrt
# JSON output with word-level timestamps
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -ojf
# Specify language (skip auto-detect for speed)
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l en
```
### Batch Transcription Script
Create this script to transcribe an entire folder of audio files:
```bash
#!/bin/bash
# batch-transcribe.sh — Transcribe all audio files in a directory
# Usage: bash batch-transcribe.sh /path/to/audio/files
INPUT_DIR="${1:-.}"
MODEL="$HOME/whisper-models/ggml-large-v3-turbo.bin"
OUTPUT_DIR="${INPUT_DIR}/transcripts"
mkdir -p "$OUTPUT_DIR"
echo "=== Batch Whisper Transcription ==="
echo "Input: $INPUT_DIR"
echo "Output: $OUTPUT_DIR"
echo "Model: ggml-large-v3-turbo"
echo ""
TOTAL=0
DONE=0
START_TIME=$(date +%s)
# Count files
for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
[ -f "$f" ] && ((TOTAL++))
done
echo "Found $TOTAL audio files"
echo ""
for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
[ -f "$f" ] || continue
((DONE++))
BASENAME=$(basename "$f" | sed 's/\.[^.]*$//')
echo "[$DONE/$TOTAL] Transcribing: $(basename "$f")"
whisper-cli \
-m "$MODEL" \
-f "$f" \
-l en \
-otxt \
-of "$OUTPUT_DIR/$BASENAME" \
2>/dev/null
# Also generate SRT for subtitle use
whisper-cli \
-m "$MODEL" \
-f "$f" \
-l en \
-osrt \
-of "$OUTPUT_DIR/$BASENAME" \
2>/dev/null
echo " -> $OUTPUT_DIR/$BASENAME.txt"
echo " -> $OUTPUT_DIR/$BASENAME.srt"
echo ""
done
END_TIME=$(date +%s)
ELAPSED=$((END_TIME - START_TIME))
echo "=== Done! $DONE files in ${ELAPSED}s ==="
```
### Convert Non-WAV Audio First
```bash
# Whisper works best with 16kHz mono WAV
# Convert any audio format with ffmpeg
# Single file
ffmpeg -i podcast.mp3 -ar 16000 -ac 1 podcast.wav
# Batch convert all MP3s in a folder
for f in *.mp3; do
ffmpeg -i "$f" -ar 16000 -ac 1 "${f%.mp3}.wav"
done
```
### Python Integration
```python
"""Transcribe audio using whisper.cpp via subprocess."""
import subprocess
import json
from pathlib import Path
def transcribe(audio_path: str, language: str = "en") -> dict:
"""Transcribe an audio file and return structured result."""
model = Path.home() / "whisper-models" / "ggml-large-v3-turbo.bin"
output_base = Path(audio_path).stem
result = subprocess.run(
[
"whisper-cli",
"-m", str(model),
"-f", audio_path,
"-l", language,
"-ojf", # JSON with full metadata
"-of", output_base,
],
capture_output=True, text=True, timeout=600,
)
# Read the JSON output
json_path = Path(f"{output_base}.json")
if json_path.exists():
with open(json_path) as f:
return json.load(f)
return {"error": result.stderr, "text": result.stdout}
# Usage
result = transcribe("meeting-recording.wav")
print(result["transcription"][0]["text"])
```
---
## Real-World Use Cases
### 1. LysnrAI Voice Dictation Pipeline
Your LysnrAI desktop app captures voice → sends to Whisper for transcription. On the Razer, this becomes near-instant:
```
Voice input (5 seconds) → Whisper (CUDA) → Text in <1 second
```
### 2. Meeting Transcription
```bash
# Record a 1-hour Zoom meeting → get full transcript in ~5 minutes
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
-f zoom-meeting.wav -l en -otxt -osrt
```
### 3. Podcast / YouTube Processing
```bash
# Download YouTube audio
yt-dlp -x --audio-format wav "https://youtube.com/watch?v=..." -o "video.wav"
# Transcribe
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f video.wav -otxt -osrt
```
### 4. Subtitle Generation
```bash
# Generate SRT subtitles for video
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
-f movie.wav -l en -osrt
# Output: movie.srt — ready to import into video editors
```
### 5. Multi-Language Transcription
```bash
# Auto-detect language
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
# Force specific language
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ja # Japanese
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ta # Tamil
```
---
## Benefits
| Benefit | Description |
| --------------- | ------------------------------------------------------------ |
| **Speed** | Process a full day of meetings in under an hour |
| **Privacy** | All transcription runs locally — no data leaves your machine |
| **Cost** | Zero API costs (vs $0.006/min for cloud Whisper API) |
| **Accuracy** | large-v3-turbo is near-human accuracy for English |
| **Offline** | Works without internet — useful on flights, trains |
| **Batch scale** | Process hundreds of files overnight |
### Cost Comparison (100 hours of audio)
| Method | Cost | Time |
| -------------------------- | ------ | ---------------- |
| OpenAI Whisper API | ~$36 | ~minutes (cloud) |
| Mac M4 Pro (local) | $0 | ~2550 hours |
| **Razer RTX 5090 (local)** | **$0** | **~713 hours** |
---
## Skills You'll Learn
| Skill | What You'll Learn | Career Value |
| ---------------------------- | ------------------------------------------------------ | -------------------------------------- |
| **Audio processing** | Sample rates, codecs, mono/stereo, WAV vs compressed | Foundational for any audio/speech work |
| **Speech-to-text pipelines** | Mel spectrograms, encoder-decoder models, beam search | Core ML/NLP skill |
| **CUDA acceleration** | How GPU parallelism speeds up neural network inference | Top ML engineering skill |
| **Batch processing** | Shell scripting for processing thousands of files | DevOps / data engineering |
| **Subtitle formats** | SRT, VTT, JSON — standards for timed text | Media tech / accessibility |
| **Model quantization** | Understanding why ggml models are smaller and faster | ML deployment knowledge |
| **ffmpeg mastery** | Audio/video conversion, resampling, format detection | Essential multimedia tool |
| **Python subprocess** | Integrating CLI tools into Python applications | Backend engineering pattern |
---
## Monitoring GPU During Transcription
```bash
# Watch GPU utilization in real-time
watch -n 0.5 nvidia-smi
# Or use nvtop for a richer view
sudo apt install nvtop
nvtop
# Expected during Whisper:
# GPU Utilization: 8095%
# VRAM Usage: ~23 GB (model + buffers)
# Power: ~150200W
```
---
## Next Steps
- [ ] Transcribe a test audio file and verify output quality
- [ ] Create `batch-transcribe.sh` in `__LOCAL_LLMs/scripts/`
- [ ] Benchmark: time a 10-minute file on Razer vs Mac
- [ ] Try multi-language transcription (English + Tamil)
- [ ] Integrate Whisper output into LysnrAI transcription pipeline
- [ ] Experiment with `whisper-cli --translate` for translation mode

View File

@ -0,0 +1,303 @@
# 3. TTS Generation at Scale
> **RTX 5090:** Qwen3-TTS at 24× realtime, Orpheus at 23× realtime
> **Why it matters:** Pre-generate audio libraries, build voice features, create content — all faster than real-time playback
---
## What Is Local TTS?
Text-to-Speech (TTS) converts written text into natural-sounding speech. Our stack has two engines:
| Engine | Architecture | Size | Voices | Quality |
| ------------- | ------------------------------- | ----------- | ----------------------- | ------------------------------------- |
| **Orpheus** | Ollama-served, SNAC decoder | 4 GB | 8 English voices | Extremely natural, emotional |
| **Qwen3-TTS** | PyTorch model, direct inference | 0.6B params | 10 languages, cloneable | Multilingual, zero-shot voice cloning |
```
┌──────────────────────────────────────────────────────────────────────┐
│ TTS Pipeline │
│ │
│ Orpheus: │
│ Text → [Ollama: generate audio tokens] → [SNAC: decode to WAV] │
│ ▲ GPU (CUDA) ▲ GPU (CUDA) │
│ │
│ Qwen3-TTS: │
│ Text → [PyTorch model: text→mel→audio] → WAV file │
│ ▲ GPU (CUDA / MPS) │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
---
## Performance: Mac vs Razer
| Engine | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
| ---------------------------- | ---------------- | --------------------- | ------- |
| Orpheus (per sentence) | ~realtime | ~23× realtime | ~2.5× |
| Qwen3-TTS (per sentence) | ~realtime | ~24× realtime | ~3× |
| Orpheus (10 min narration) | ~10 min | ~35 min | ~2.5× |
| Qwen3-TTS (10 min narration) | ~10 min | ~2.55 min | ~3× |
| Batch: 100 sentences | ~58 min | ~23 min | ~3× |
| Batch: 1000 sentences | ~5080 min | ~1525 min | ~3× |
**"24× realtime" means:** A 10-second sentence generates in 2.55 seconds. The audio is produced faster than you could listen to it.
---
## How to Use It
### Orpheus TTS (8 Natural Voices)
Orpheus runs through Ollama + SNAC decoder. Already set up by `setup-tts.sh`.
```bash
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
# Generate speech with default voice (tara)
.venv-qwen-tts/bin/python test_orpheus_tts.py
# Output: test_orpheus_tara.wav, test_orpheus_leah.wav, etc.
# Play: aplay test_orpheus_tara.wav
```
#### Available Orpheus Voices
| Voice | Character | Best For |
| ------ | ------------------ | --------------------- |
| `tara` | Young female, warm | Narration, assistants |
| `leah` | Female, clear | Tutorials, guides |
| `jess` | Female, energetic | Announcements |
| `leo` | Male, calm | Narration, podcasts |
| `dan` | Male, professional | Business content |
| `mia` | Female, friendly | Conversational |
| `zac` | Male, young | Casual content |
| `zoe` | Female, neutral | General purpose |
#### Custom Text with Orpheus (Python)
```python
"""Generate speech from custom text using Orpheus TTS."""
import json
import struct
import wave
import urllib.request
OLLAMA_URL = "http://localhost:11434/api/generate"
def generate_speech(text: str, voice: str = "tara", output_path: str = "output.wav"):
"""Generate a WAV file from text using Orpheus via Ollama."""
prompt = f"<|audio|>{voice}: {text}"
payload = json.dumps({
"model": "sematre/orpheus:en",
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.6, "top_p": 0.9}
}).encode()
req = urllib.request.Request(
OLLAMA_URL,
data=payload,
headers={"Content-Type": "application/json"}
)
with urllib.request.urlopen(req, timeout=120) as resp:
result = json.loads(resp.read())
# Decode audio tokens → SNAC → WAV
# (Full implementation in test_orpheus_tts.py)
print(f"Generated: {output_path}")
# Example
generate_speech(
"Welcome to LysnrAI. Your voice-first productivity assistant.",
voice="tara",
output_path="welcome.wav"
)
```
### Qwen3-TTS (Multilingual + Voice Cloning)
```bash
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
# Generate speech with Qwen3-TTS
.venv-qwen-tts/bin/python test_qwen_tts.py
# Output: test_qwen3_tts_output.wav
# Play: aplay test_qwen3_tts_output.wav
```
#### Qwen3-TTS Features
| Feature | Description |
| ------------------- | ----------------------------------------------------------------------------------------- |
| **10 languages** | English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian |
| **Voice cloning** | Provide a reference audio clip → model clones the voice |
| **Emotion control** | Adjust speaking style via prompt engineering |
| **0.6B parameters** | Small enough to run fast, large enough for quality |
### Batch TTS Generation Script
```bash
#!/bin/bash
# batch-tts.sh — Generate audio for a list of sentences
# Usage: bash batch-tts.sh sentences.txt output_dir/
INPUT_FILE="${1:-sentences.txt}"
OUTPUT_DIR="${2:-tts_output}"
VOICE="${3:-tara}"
mkdir -p "$OUTPUT_DIR"
echo "=== Batch Orpheus TTS ==="
echo "Input: $INPUT_FILE"
echo "Output: $OUTPUT_DIR"
echo "Voice: $VOICE"
echo ""
LINE_NUM=0
while IFS= read -r line; do
[ -z "$line" ] && continue
((LINE_NUM++))
echo "[$LINE_NUM] Generating: ${line:0:60}..."
# Use the Python TTS script with custom text
.venv-qwen-tts/bin/python -c "
import test_orpheus_tts as tts
tts.generate_and_save('$line', '$VOICE', '$OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav')
" 2>/dev/null
echo " -> $OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav"
done < "$INPUT_FILE"
echo ""
echo "=== Done! $LINE_NUM files generated ==="
```
---
## Real-World Use Cases
### 1. LysnrAI Voice Responses
Generate spoken responses from LLM output — the Razer can produce audio faster than the user can listen:
```
User asks question → LLM generates text → TTS converts to speech → User hears answer
24× realtime on RTX 5090
Feels instant for short responses
```
### 2. Pre-Generated Audio Libraries
Build a library of common phrases, greetings, and responses:
```bash
# sentences.txt
Welcome to LysnrAI.
Your daily brief is ready.
You have three new memories to review.
Recording started.
Recording saved successfully.
Transcription complete.
```
```bash
# Generate all phrases in multiple voices
for voice in tara leo dan; do
bash batch-tts.sh sentences.txt audio_library/ "$voice"
done
```
### 3. Audiobook / Podcast Generation
```python
# Split a document into paragraphs and generate audio for each
paragraphs = open("article.txt").read().split("\n\n")
for i, para in enumerate(paragraphs):
generate_speech(para, voice="leo", output_path=f"chapter_{i:03d}.wav")
# Concatenate with ffmpeg
# ffmpeg -f concat -i filelist.txt -c copy full_audiobook.wav
```
### 4. Multilingual Content (Qwen3-TTS)
```python
# Generate the same message in multiple languages
messages = {
"en": "Welcome to MindLyst, your AI-powered life organizer.",
"ja": "MindLystへようこそ。AIライフオーガナイザーです。",
"es": "Bienvenido a MindLyst, tu organizador de vida con IA.",
}
for lang, text in messages.items():
generate_qwen_tts(text, output_path=f"welcome_{lang}.wav")
```
### 5. Voice Cloning (Qwen3-TTS)
```python
# Clone a voice from a reference audio sample
# Provide a 515 second reference clip of the target voice
reference_audio = "my_voice_sample.wav"
text = "This is my cloned voice saying something new."
# Qwen3-TTS can reproduce the voice characteristics
generate_qwen_tts(text, reference=reference_audio, output_path="cloned.wav")
```
---
## Benefits
| Benefit | Description |
| ----------------- | -------------------------------------------------------- |
| **Speed** | Generate audio faster than playback speed |
| **Privacy** | All voice generation runs locally — no cloud APIs |
| **Cost** | $0 vs $0.015/1K chars for cloud TTS (Google, ElevenLabs) |
| **Voice variety** | 8 Orpheus voices + unlimited Qwen3-TTS voice cloning |
| **Multilingual** | Qwen3-TTS supports 10 languages natively |
| **Offline** | Works without internet |
| **Customizable** | Control emotion, speed, voice characteristics |
### Cost Comparison (Generate 1 hour of audio)
| Method | Cost | Time |
| -------------------------- | ------- | ---------------- |
| ElevenLabs API | ~$1530 | ~minutes (cloud) |
| Google Cloud TTS | ~$416 | ~minutes (cloud) |
| Mac M4 Pro (local) | $0 | ~60 min |
| **Razer RTX 5090 (local)** | **$0** | **~1530 min** |
---
## Skills You'll Learn
| Skill | What You'll Learn | Career Value |
| ------------------------- | ------------------------------------------------------- | ------------------------------ |
| **Speech synthesis** | How neural TTS works (text→tokens→mel→audio) | Core speech/NLP skill |
| **Audio codecs** | SNAC, Encodec, WAV format, sample rates | Audio engineering fundamentals |
| **Voice cloning** | Zero-shot voice cloning techniques | Cutting-edge ML research area |
| **Batch processing** | Automating large-scale audio generation | Production engineering |
| **GPU memory** | Managing VRAM for concurrent TTS + LLM workloads | ML ops knowledge |
| **Audio post-processing** | ffmpeg: concatenation, normalization, format conversion | Multimedia engineering |
| **API design** | Building REST APIs around TTS engines | Backend engineering |
| **Multilingual NLP** | Cross-language text processing and pronunciation | Global product development |
---
## Next Steps
- [ ] Generate test audio with both Orpheus and Qwen3-TTS on Razer
- [ ] Create `batch-tts.sh` in `__LOCAL_LLMs/scripts/`
- [ ] Build a pre-generated audio library for LysnrAI common phrases
- [ ] Experiment with Qwen3-TTS voice cloning using your own voice
- [ ] Try generating audio in Tamil (Qwen3-TTS multilingual)
- [ ] Measure actual generation speed: words-per-second on each engine

View File

@ -0,0 +1,322 @@
# 4. Fine-Tuning / Training
> **RTX 5090:** 24 GB VRAM enables LoRA fine-tuning of 7B13B models locally
> **Why it matters:** Customize models for your specific use cases — coding style, domain knowledge, voice commands
---
## What Is Fine-Tuning?
Fine-tuning takes a pre-trained model (like Llama 3.1 8B) and trains it further on your own data to specialize its behavior. Instead of training from scratch (which costs millions), you adjust a small fraction of the model's weights.
```
┌──────────────────────────────────────────────────────────────────────┐
│ Fine-Tuning vs Prompting │
│ │
│ Prompting: "You are a coding assistant for TypeScript..." │
│ Works OK, but limited by context window │
│ Model doesn't truly "learn" your preferences │
│ │
│ Fine-Tuning: Train on 1000s of your code examples │
│ Model internalizes your coding patterns │
│ Better quality, no prompt overhead, faster inference │
│ │
│ LoRA: Fine-tune only ~15% of parameters │
│ Needs 1624 GB VRAM (fits RTX 5090) │
│ Training time: hours, not days │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
### Why Not on Mac?
| Aspect | Mac M4 Pro (MPS) | RTX 5090 (CUDA) |
| ----------------- | ------------------------------ | ------------------ |
| Training support | Limited MPS support | Full CUDA + cuDNN |
| Framework support | PyTorch MPS (some ops missing) | Full PyTorch CUDA |
| VRAM for training | ~30 GB usable (shared) | 24 GB dedicated |
| Memory bandwidth | ~273 GB/s | ~1,000+ GB/s |
| Training speed | 510× slower than CUDA | Baseline |
| LoRA libraries | Partial compatibility | Full compatibility |
**Training is compute AND memory bandwidth intensive** — the RTX 5090's ~1 TB/s VRAM bandwidth makes it 510× faster than MPS for gradient computation.
---
## Fine-Tuning Methods
### LoRA (Low-Rank Adaptation) — Recommended
Trains only small "adapter" matrices (~15% of model parameters). The base model stays frozen.
```
┌──────────────────────────────────────────────────────────────────────┐
│ LoRA Architecture │
│ │
│ Base Model (frozen, ~16 GB) │
│ ┌─────────────────────────────────────────────┐ │
│ │ Layer 1: [Attention] [FFN] │ │
│ │ Layer 2: [Attention] [FFN] │ ← Not modified │
│ │ ... │ │
│ │ Layer N: [Attention] [FFN] │ │
│ └─────────────────────────────────────────────┘ │
│ ↕ small adapters (rank 864) │
│ ┌─────────────────────────────────────────────┐ │
│ │ LoRA Adapter A (64 KB per layer) │ ← Trainable │
│ │ LoRA Adapter B (64 KB per layer) │ ← Trainable │
│ └─────────────────────────────────────────────┘ │
│ │
│ Total trainable params: ~1050 MB (vs 816 GB base) │
│ VRAM needed: ~1822 GB for 7B model │
│ Training time: ~14 hours for 1000 examples │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
### QLoRA (Quantized LoRA) — For Larger Models
Loads the base model in 4-bit quantization, trains LoRA adapters in FP16. Halves memory requirements.
| Method | 7B Model VRAM | 13B Model VRAM | 32B Model VRAM |
| ----------------- | ------------- | -------------- | -------------- |
| Full fine-tune | ~56 GB | ~104 GB | ~256 GB |
| LoRA (FP16) | ~18 GB | ~32 GB | ~72 GB |
| **QLoRA (4-bit)** | **~8 GB** | **~14 GB** | **~22 GB** |
**QLoRA at 32B fits in 24 GB VRAM** — this is remarkable for a laptop.
---
## How to Use It
### Setup: Install Training Libraries
```bash
# Activate the existing TTS venv (or create a new one)
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
python3.12 -m venv .venv-training
source .venv-training/bin/activate
# Install training stack
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers datasets accelerate peft bitsandbytes trl
pip install wandb # optional: experiment tracking
```
### Example 1: Fine-Tune Llama 3.1 8B for Your Coding Style
```python
"""Fine-tune Llama 3.1 8B with LoRA on your coding examples."""
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# 1. Prepare your data
# Format: instruction → response pairs from YOUR code
training_data = [
{
"instruction": "Write a Fastify route that creates a user with Zod validation",
"response": """import { z } from 'zod';
import type { FastifyInstance } from 'fastify';
const CreateUserSchema = z.object({
email: z.string().email(),
name: z.string().min(1).max(100),
productId: z.string(),
});
export async function userRoutes(app: FastifyInstance) {
app.post('/users', async (req, reply) => {
const body = CreateUserSchema.parse(req.body);
const user = await req.server.userRepository.create(body);
return reply.status(201).send(user);
});
}"""
},
# Add 1001000 more examples from your actual codebase
]
dataset = Dataset.from_list([
{"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
for d in training_data
])
# 2. Load model in 4-bit (QLoRA)
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto",
)
# 3. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank (higher = more capacity, more VRAM)
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~10M / total: ~8B (0.13%)
# 4. Train
training_args = TrainingArguments(
output_dir="./lora-llama-coding",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
save_steps=100,
fp16=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
# 5. Save the adapter (small file, ~50 MB)
model.save_pretrained("./lora-llama-coding")
tokenizer.save_pretrained("./lora-llama-coding")
```
### Example 2: Fine-Tune for LysnrAI Voice Commands
```python
# Train a model to understand your specific voice command patterns
training_data = [
{"instruction": "Parse: remind me to call john tomorrow at 3pm",
"response": '{"action": "reminder", "contact": "john", "time": "tomorrow 3pm", "task": "call"}'},
{"instruction": "Parse: add milk to my grocery list",
"response": '{"action": "add_item", "list": "grocery", "item": "milk"}'},
{"instruction": "Parse: summarize my meeting notes from yesterday",
"response": '{"action": "summarize", "source": "meeting_notes", "date": "yesterday"}'},
# ... hundreds more examples
]
```
### Example 3: Convert LoRA to Ollama Model
After training, merge the LoRA adapter and convert to GGUF for Ollama:
```bash
# Merge LoRA adapter back into base model
python -c "
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained('./lora-llama-coding')
merged = model.merge_and_unload()
merged.save_pretrained('./merged-llama-coding')
"
# Convert to GGUF (requires llama.cpp)
cd ~/llama.cpp
python convert_hf_to_gguf.py ../merged-llama-coding --outtype q4_k_m
# Create Ollama model
cat > Modelfile <<EOF
FROM ./merged-llama-coding.gguf
SYSTEM "You are a TypeScript coding assistant specialized in Fastify, Zod, and Azure Cosmos DB."
EOF
ollama create my-coding-model -f Modelfile
ollama run my-coding-model
```
---
## What Models Can You Fine-Tune on 24 GB VRAM?
| Model | Method | VRAM Usage | Feasible? | Training Time (1K examples) |
| ------------- | ----------- | ---------- | -------------- | --------------------------- |
| Llama 3.1 8B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~12 hours |
| Qwen 2.5 7B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~12 hours |
| Llama 3.1 8B | LoRA FP16 | ~18 GB | ✅ Fits | ~23 hours |
| Mistral 7B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~12 hours |
| Llama 3.1 13B | QLoRA 4-bit | ~14 GB | ✅ Fits | ~35 hours |
| CodeLlama 34B | QLoRA 4-bit | ~22 GB | ⚠️ Tight | ~610 hours |
| Llama 3.1 70B | QLoRA 4-bit | ~40 GB | ❌ Too large | Need multi-GPU |
---
## Benefits
| Benefit | Description |
| -------------------- | ------------------------------------------------------------------- |
| **Personalization** | Model learns YOUR coding style, patterns, and conventions |
| **Domain expertise** | Train on your project's specific API patterns, schemas, terminology |
| **Smaller + faster** | A fine-tuned 7B can outperform a generic 32B on your specific tasks |
| **Privacy** | Your training data never leaves your machine |
| **Cost** | $0 vs $1001000s for cloud fine-tuning (OpenAI, Together AI) |
| **Iteration speed** | Train → test → adjust in hours, not days |
| **Portable output** | Export to GGUF/Ollama — runs anywhere |
---
## Skills You'll Learn
| Skill | What You'll Learn | Career Value |
| -------------------------- | -------------------------------------------------------- | --------------------------------------------- |
| **LoRA / QLoRA** | Parameter-efficient fine-tuning techniques | Top ML engineering skill (20242026 standard) |
| **Hugging Face ecosystem** | transformers, datasets, peft, trl, accelerate | Industry-standard ML tooling |
| **Training loops** | Loss functions, learning rates, gradient accumulation | Core ML fundamentals |
| **Data preparation** | Curating, cleaning, formatting training datasets | Critical for any ML project |
| **Quantization** | 4-bit, 8-bit, FP16 — trade-offs and techniques | Essential for deployment |
| **Model evaluation** | Perplexity, human eval, A/B testing fine-tuned vs base | ML product development |
| **VRAM management** | Gradient checkpointing, mixed precision, batch sizing | GPU optimization |
| **Model merging** | Merging LoRA adapters, converting to GGUF | ML deployment pipeline |
| **Experiment tracking** | Weights & Biases, training curves, hyperparameter tuning | Professional ML workflow |
---
## Training Data Sources for Your Projects
| Source | Data Type | Fine-Tune Goal |
| ---------------------- | ---------------------- | ------------------------ |
| Your GitHub repos | TypeScript/Python code | Coding style model |
| LysnrAI voice commands | Command → JSON pairs | Voice command parser |
| MindLyst triage logs | Input → categorization | Content triage model |
| Your commit messages | Diff → message pairs | Commit message generator |
| Your code reviews | Code → feedback pairs | Code review assistant |
| Slack/Teams messages | Conversations | Writing style model |
### Extracting Training Data from Your Repos
```bash
# Extract all TypeScript functions as training examples
find ~/code/mygh/learning_ai_common_plat -name "*.ts" -exec grep -l "export" {} \; | head -20
# Extract commit messages paired with diffs
cd ~/code/mygh/learning_ai_common_plat
git log --oneline -100 --format="%H %s" | while read hash msg; do
echo "=== $msg ==="
git diff "$hash^" "$hash" --stat
echo ""
done
```
---
## Next Steps
- [ ] Install training libraries in a dedicated venv
- [ ] Collect 100+ instruction/response pairs from your codebase
- [ ] Run a QLoRA fine-tune of Llama 3.1 8B on your coding examples
- [ ] Evaluate: compare fine-tuned model vs base model on 20 test prompts
- [ ] Convert to GGUF and serve through Ollama
- [ ] Fine-tune a voice command parser for LysnrAI
- [ ] Experiment with different LoRA ranks (8, 16, 32, 64) and measure quality

View File

@ -0,0 +1,382 @@
# 5. CUDA / TensorRT / ML Research
> **RTX 5090:** Full NVIDIA toolchain — CUDA 13.x, cuDNN, TensorRT, Triton
> **Why it matters:** Most ML papers, frameworks, and production systems are CUDA-first. This is the industry-standard GPU compute platform.
---
## What Is the NVIDIA ML Toolchain?
NVIDIA provides a layered stack of tools that turn your GPU into a general-purpose compute engine:
```
┌──────────────────────────────────────────────────────────────────────┐
│ NVIDIA ML Toolchain (RTX 5090) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Your Code (Python / C++ / TypeScript) │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ML Frameworks │ │
│ │ PyTorch · TensorFlow · JAX · ONNX Runtime │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ NVIDIA Libraries │ │
│ │ TensorRT · Triton · cuDNN · cuBLAS · NCCL │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CUDA Runtime + Driver │ │
│ │ CUDA 13.x · GPU scheduling · memory management │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Hardware │ │
│ │ RTX 5090: ~18K CUDA cores · 24 GB GDDR7 · Tensor cores │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
### Component Breakdown
| Component | What It Does | Why You Need It |
| ------------------- | ----------------------------------------------------------- | -------------------------------- |
| **CUDA** | General-purpose GPU programming | Foundation for everything |
| **cuDNN** | Optimized neural network primitives (conv, attention, etc.) | 25× faster training/inference |
| **TensorRT** | Model optimization + inference engine | 24× faster deployment inference |
| **Triton** (NVIDIA) | Inference server for serving models at scale | Production model serving |
| **Triton** (OpenAI) | GPU kernel compiler (write custom GPU kernels in Python) | Research + custom ops |
| **cuBLAS** | Optimized matrix multiplication | Core of all neural network math |
| **NCCL** | Multi-GPU communication | Distributed training (future) |
---
## How to Set Up
### 1. CUDA Toolkit (Inside WSL2)
```bash
# Check if CUDA is already available (from Windows driver passthrough)
nvidia-smi
nvcc --version # CUDA compiler
# If nvcc is not found, install CUDA toolkit
# (nvidia-smi works from driver passthrough, but nvcc needs the toolkit)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-4
# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify
nvcc --version
# Should show: CUDA 12.4+
```
### 2. cuDNN
```bash
# cuDNN is usually bundled with PyTorch, but for custom builds:
sudo apt install -y libcudnn8 libcudnn8-dev
# Verify in Python
python3 -c "import torch; print(f'cuDNN: {torch.backends.cudnn.version()}')"
```
### 3. TensorRT
```bash
# Install TensorRT
pip install tensorrt
# Or via apt for system-wide
sudo apt install -y tensorrt
# Verify
python3 -c "import tensorrt; print(f'TensorRT: {tensorrt.__version__}')"
```
### 4. PyTorch with Full CUDA Support
```bash
# Create a research environment
python3.12 -m venv ~/.venv-ml-research
source ~/.venv-ml-research/bin/activate
# Install PyTorch with CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify
python3 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
print(f'cuDNN: {torch.backends.cudnn.version()}')
"
```
---
## How to Use It
### CUDA Programming Basics (Python)
```python
"""Your first CUDA program — matrix multiplication on the GPU."""
import torch
import time
device = torch.device("cuda")
# Create two large matrices on the GPU
A = torch.randn(4096, 4096, device=device)
B = torch.randn(4096, 4096, device=device)
# Warm up
_ = torch.mm(A, B)
torch.cuda.synchronize()
# Benchmark
start = time.perf_counter()
for _ in range(100):
C = torch.mm(A, B)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tflops = (2 * 4096**3 * 100) / elapsed / 1e12
print(f"Matrix multiply: {elapsed:.3f}s for 100 iterations")
print(f"Throughput: {tflops:.1f} TFLOPS")
# Expected on RTX 5090: ~100200 TFLOPS (FP32) or ~200400 TFLOPS (FP16)
```
### TensorRT: Optimize a Model for Faster Inference
```python
"""Convert a PyTorch model to TensorRT for 24× faster inference."""
import torch
import torch_tensorrt
# Load a model
model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True).eval().cuda()
# Compile with TensorRT
trt_model = torch_tensorrt.compile(
model,
inputs=[torch_tensorrt.Input(shape=[1, 3, 224, 224], dtype=torch.float16)],
enabled_precisions={torch.float16},
)
# Benchmark
input_tensor = torch.randn(1, 3, 224, 224, device="cuda", dtype=torch.float16)
# PyTorch baseline
with torch.no_grad():
start = time.perf_counter()
for _ in range(1000):
_ = model(input_tensor.float())
torch.cuda.synchronize()
pytorch_time = time.perf_counter() - start
# TensorRT optimized
with torch.no_grad():
start = time.perf_counter()
for _ in range(1000):
_ = trt_model(input_tensor)
torch.cuda.synchronize()
trt_time = time.perf_counter() - start
print(f"PyTorch: {pytorch_time:.3f}s")
print(f"TensorRT: {trt_time:.3f}s")
print(f"Speedup: {pytorch_time/trt_time:.1f}×")
```
### Reproducing ML Research Papers
Most ML papers provide CUDA-only code. The RTX 5090 lets you run them directly:
```bash
# Example: Run a recent paper's code
git clone https://github.com/some-researcher/cool-new-model.git
cd cool-new-model
# Typical requirements
pip install -r requirements.txt
# Run training/evaluation
python train.py --device cuda --epochs 10 --batch-size 16
# This would NOT work on Mac (CUDA-only dependencies)
```
### Custom CUDA Kernels with Triton (OpenAI)
```python
"""Write a custom GPU kernel in Python using Triton."""
import triton
import triton.language as tl
import torch
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
"""Simple vector addition kernel running on GPU."""
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
# Run the kernel
n = 1_000_000
x = torch.randn(n, device="cuda")
y = torch.randn(n, device="cuda")
output = torch.empty_like(x)
grid = lambda meta: (triton.cdiv(n, meta["BLOCK_SIZE"]),)
add_kernel[grid](x, y, output, n, BLOCK_SIZE=1024)
print(f"Result correct: {torch.allclose(output, x + y)}")
```
---
## Real-World Use Cases for Your Projects
### 1. LysnrAI Inference Optimization
Convert your Whisper or TTS models to TensorRT for even faster inference:
```python
# Optimize Whisper encoder with TensorRT
# This can give another 2× speedup on top of whisper.cpp CUDA
```
### 2. Custom Embedding Models
Train domain-specific embedding models for LysnrAI's semantic search:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
# Encode your documents
documents = ["meeting notes about Q1 budget", "grocery list for weekend", ...]
embeddings = model.encode(documents, batch_size=64, show_progress_bar=True)
# ~1000 documents/second on RTX 5090
```
### 3. Reproduce and Experiment with New Models
When a new paper drops (GPT-5 alternatives, new TTS models, new architectures):
```bash
# Clone → install → run — no CUDA compatibility issues
git clone https://github.com/new-cool-model
pip install -r requirements.txt
python evaluate.py --device cuda
```
### 4. Benchmarking and Profiling
```python
# Profile GPU usage during inference
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
with_stack=True,
) as prof:
model(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total"))
```
---
## Benefits
| Benefit | Description |
| ------------------------- | --------------------------------------------------------- |
| **Industry standard** | 95%+ of ML research and production uses CUDA |
| **Framework support** | PyTorch, TensorFlow, JAX — all CUDA-first |
| **Paper reproduction** | Run any ML paper's code without compatibility issues |
| **TensorRT optimization** | 24× faster inference on optimized models |
| **Custom kernels** | Write GPU code in Python (Triton) or C++ (CUDA) |
| **Profiling tools** | nvidia-smi, Nsight, PyTorch profiler — rich debugging |
| **Production parity** | Code runs identically on cloud GPU instances (A100, H100) |
---
## Skills You'll Learn
| Skill | What You'll Learn | Career Value |
| ------------------------- | ----------------------------------------------------- | ---------------------------------- |
| **CUDA fundamentals** | GPU memory model, kernel launches, synchronization | Core ML infrastructure skill |
| **TensorRT** | Model optimization, quantization, graph fusion | Production ML deployment |
| **Triton kernels** | Custom GPU programming in Python | Research + performance engineering |
| **PyTorch profiling** | Identifying bottlenecks, optimizing training loops | Essential ML engineering |
| **cuDNN** | Optimized neural network operations | Framework-level understanding |
| **Mixed precision** | FP16/BF16 training, loss scaling, numerical stability | Modern training standard |
| **GPU memory management** | Memory pools, caching allocator, OOM debugging | Practical ML engineering |
| **Model optimization** | Graph optimization, operator fusion, quantization | ML deployment pipeline |
| **Benchmark design** | Fair comparison methodology, statistical significance | Research methodology |
### Career Impact
CUDA proficiency is one of the most valuable ML engineering skills. Here's where it maps:
| Role | How CUDA Skills Apply |
| --------------------- | ----------------------------------------------------- |
| **ML Engineer** | Optimize training pipelines, reduce inference latency |
| **AI Infrastructure** | Build and maintain GPU clusters, optimize throughput |
| **Research Engineer** | Implement custom operations for novel architectures |
| **MLOps** | TensorRT deployment, GPU monitoring, autoscaling |
| **Full-Stack AI** | End-to-end: train → optimize → serve → monitor |
---
## Monitoring and Debugging
```bash
# Real-time GPU monitoring
watch -n 0.5 nvidia-smi
# Detailed GPU info
nvidia-smi -q
# GPU process list
nvidia-smi pmon
# Install nvtop (interactive GPU monitor)
sudo apt install nvtop
nvtop
# PyTorch memory debugging
python3 -c "
import torch
torch.cuda.memory_summary(device=0, abbreviated=True)
"
```
---
## Next Steps
- [ ] Install CUDA toolkit + cuDNN in WSL2
- [ ] Verify PyTorch CUDA with a matrix multiply benchmark
- [ ] Run a TensorRT optimization on a simple model
- [ ] Write a Triton kernel (vector add → custom attention)
- [ ] Profile an Ollama inference request with nvidia-smi
- [ ] Try reproducing a recent ML paper from GitHub
- [ ] Benchmark: PyTorch vs TensorRT inference speed for Whisper

View File

@ -0,0 +1,325 @@
# 6. Stable Diffusion / Image Generation
> **RTX 5090:** 58 seconds per image (SDXL) vs ~30 seconds on Mac
> **Why it matters:** Generate UI mockups, app icons, marketing assets, concept art — all locally and free
---
## What Is Stable Diffusion?
Stable Diffusion is an open-source text-to-image AI model. You describe what you want in plain English, and it generates a high-quality image in seconds. It runs entirely on your GPU.
```
┌──────────────────────────────────────────────────────────────────────┐
│ Stable Diffusion Pipeline │
│ │
│ "A modern app dashboard with dark theme and blue accents" │
│ │ │
│ ▼ │
│ [CLIP Text Encoder] → text embeddings │
│ │ │
│ ▼ │
│ [U-Net: iterative denoising × 2050 steps] ← GPU-intensive │
│ │ │
│ ▼ │
│ [VAE Decoder] → pixel image │
│ │ │
│ ▼ │
│ 1024×1024 PNG image │
│ │
│ RTX 5090: ~58s per image (SDXL) │
│ Mac M4 Pro: ~2535s per image (SDXL via MPS) │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
---
## Performance: Mac vs Razer
| Model | Resolution | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
| ---------------- | ---------- | ---------------- | --------------------- | ------- |
| SD 1.5 | 512×512 | ~812s | ~12s | ~5× |
| SDXL | 1024×1024 | ~2535s | ~58s | ~4× |
| SDXL Turbo | 1024×1024 | ~812s | ~13s | ~4× |
| FLUX.1 [dev] | 1024×1024 | ~6090s | ~1020s | ~5× |
| FLUX.1 [schnell] | 1024×1024 | ~1525s | ~36s | ~4× |
### VRAM Requirements
| Model | VRAM Needed | Fits in 24 GB? |
| ----------------- | ----------- | -------------- |
| SD 1.5 | ~4 GB | ✅ Easily |
| SDXL | ~7 GB | ✅ Easily |
| SDXL + ControlNet | ~10 GB | ✅ Yes |
| FLUX.1 [dev] | ~12 GB | ✅ Yes |
| FLUX.1 + LoRA | ~14 GB | ✅ Yes |
| SD3 Medium | ~12 GB | ✅ Yes |
**24 GB VRAM means every current image model fits comfortably.**
---
## How to Set Up
### Option A: ComfyUI (Node-Based — Recommended)
ComfyUI is a powerful node-based UI for Stable Diffusion. It's flexible, efficient, and well-suited for automation.
```bash
# Clone ComfyUI
cd ~
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
# Create venv and install
python3.12 -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Download SDXL model (~6.5 GB)
mkdir -p models/checkpoints
cd models/checkpoints
wget "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
# Start ComfyUI
cd ~/ComfyUI
python main.py
# Open in browser: http://localhost:8188
# Accessible from Windows browser via WSL2 port forwarding
```
### Option B: Automatic1111 (Classic Web UI)
```bash
# Clone A1111
cd ~
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui
# Launch (auto-installs deps on first run)
bash webui.sh --listen --xformers
# Open: http://localhost:7860
```
### Option C: Python Script (No UI)
```python
"""Generate images with Stable Diffusion from Python."""
import torch
from diffusers import StableDiffusionXLPipeline
# Load SDXL
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
# Enable memory optimizations
pipe.enable_xformers_memory_efficient_attention()
# Generate
image = pipe(
prompt="A sleek dark-themed mobile app dashboard showing AI brain categories, "
"neon blue and teal accents, glassmorphism cards, modern UI design",
negative_prompt="blurry, low quality, text, watermark",
num_inference_steps=30,
guidance_scale=7.5,
width=1024,
height=1024,
).images[0]
image.save("dashboard_concept.png")
print("Generated: dashboard_concept.png")
```
### Batch Generation Script
```python
"""Generate multiple variations of an image concept."""
import torch
from diffusers import StableDiffusionXLPipeline
import time
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
prompts = [
"App icon for LysnrAI, microphone with sound waves, dark background, modern gradient",
"App icon for MindLyst, brain with neural connections, dark background, blue teal gradient",
"Feature graphic for LysnrAI, voice waveform visualization, dark theme, 1024x500",
"Splash screen, abstract sound waves, dark navy background, teal highlights",
"Dashboard mockup, dark theme, cards with charts, sidebar navigation, modern UI",
]
for i, prompt in enumerate(prompts):
start = time.time()
image = pipe(
prompt=prompt,
negative_prompt="blurry, low quality, text, watermark, ugly",
num_inference_steps=30,
guidance_scale=7.5,
width=1024,
height=1024,
generator=torch.Generator("cuda").manual_seed(42 + i),
).images[0]
elapsed = time.time() - start
filename = f"generated_{i:02d}.png"
image.save(filename)
print(f"[{i+1}/{len(prompts)}] {filename} ({elapsed:.1f}s)")
```
---
## Real-World Use Cases for Your Projects
### 1. App Store Assets
Generate icon concepts, feature graphics, and screenshot backgrounds:
```python
# LysnrAI app icon variations
prompts = [
"Minimal app icon, single microphone, dark navy background, glowing teal outline, flat design",
"App icon, sound wave forming a brain shape, dark background, blue to teal gradient",
"App icon, headphones with AI sparkles, dark background, modern glassmorphism",
]
```
### 2. UI/UX Mockup Exploration
Rapidly prototype visual ideas before coding:
```python
# Generate dashboard layout concepts
prompt = """
Web dashboard design, dark theme (#06070A background), sidebar navigation,
main content area with 3 cards showing brain categories,
teal and blue accent colors, modern glassmorphism,
high fidelity UI design, clean typography
"""
```
### 3. Marketing and Social Media
```python
# Blog post hero images
prompt = "Abstract AI visualization, neural network nodes connected by light beams, dark background, blue and teal colors, cinematic lighting"
# Social media posts
prompt = "Infographic style, voice AI assistant concept, microphone with sound waves transforming into text, dark modern design"
```
### 4. Concept Art for Features
```python
# Visualize MindLyst "Brains" concept
prompts = [
"Digital brain labeled 'Work', organized files and charts floating around it, dark theme, blue glow",
"Digital brain labeled 'Health', fitness and medical icons floating around it, dark theme, green glow",
"Digital brain labeled 'Finance', money and graphs floating around it, dark theme, gold glow",
]
```
### 5. ControlNet (Image-Guided Generation)
Use an existing image as a structural guide:
```python
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
# Load ControlNet for edge-guided generation
controlnet = ControlNetModel.from_pretrained(
"diffusers/controlnet-canny-sdxl-1.0",
torch_dtype=torch.float16,
).to("cuda")
# Use your existing dashboard screenshot as a structural guide
# Generate a redesigned version with new visual style
```
---
## Benefits
| Benefit | Description |
| --------------------------- | ---------------------------------------------------------------- |
| **Speed** | 58s per image (vs 30s on Mac or waiting for cloud APIs) |
| **Cost** | $0 per image (vs $0.020.08 per image for DALL-E 3 / Midjourney) |
| **Privacy** | Generated locally — no images sent to cloud |
| **Control** | Full parameter control: steps, guidance, seed, resolution |
| **Reproducibility** | Same seed = same image every time |
| **Customization** | LoRA fine-tuning for brand-specific styles |
| **Batch capability** | Generate hundreds of variations overnight |
| **No content restrictions** | No cloud content policies limiting your output |
### Cost Comparison (100 images)
| Method | Cost | Time |
| -------------------------------- | ------------ | -------------- |
| DALL-E 3 API | $48 | ~5 min (cloud) |
| Midjourney | $1030/month | ~5 min (cloud) |
| Mac M4 Pro (SDXL, local) | $0 | ~4055 min |
| **Razer RTX 5090 (SDXL, local)** | **$0** | **~813 min** |
---
## Skills You'll Learn
| Skill | What You'll Learn | Career Value |
| ------------------------- | --------------------------------------------------- | ---------------------------- |
| **Diffusion models** | How iterative denoising generates images from noise | Core generative AI knowledge |
| **Prompt engineering** | Crafting effective text prompts for visual output | Universal AI skill |
| **ControlNet** | Structural guidance for image generation | Advanced image AI |
| **LoRA training** | Fine-tuning image models for specific styles | Generative AI customization |
| **ComfyUI workflows** | Node-based visual programming for AI pipelines | Production image generation |
| **Image post-processing** | Upscaling, inpainting, outpainting techniques | Digital content creation |
| **VRAM optimization** | Model offloading, attention optimization, tiling | GPU memory management |
| **Batch automation** | Scripting large-scale image generation | Production engineering |
| **Model selection** | SD 1.5 vs SDXL vs FLUX — trade-offs | Practical AI judgment |
---
## Advanced: FLUX.1 (Latest Generation)
FLUX.1 is the newest open-source image model (from Black Forest Labs / Stability AI alumni). Better quality than SDXL.
```bash
# FLUX.1 [schnell] — fast, 4 steps
# FLUX.1 [dev] — high quality, 50 steps
# Fits in 24 GB VRAM with FP16
python3 -c "
from diffusers import FluxPipeline
import torch
pipe = FluxPipeline.from_pretrained(
'black-forest-labs/FLUX.1-schnell',
torch_dtype=torch.float16,
).to('cuda')
image = pipe('A futuristic AI assistant interface, holographic UI, dark theme').images[0]
image.save('flux_test.png')
"
```
---
## Next Steps
- [ ] Install ComfyUI in WSL2 and verify CUDA acceleration
- [ ] Download SDXL base model and generate first test image
- [ ] Generate app icon concepts for LysnrAI and MindLyst
- [ ] Try ControlNet with an existing dashboard screenshot
- [ ] Experiment with FLUX.1 [schnell] for higher quality
- [ ] Create a batch script for generating marketing assets
- [ ] Build a style LoRA trained on your brand colors/aesthetic

View File

@ -0,0 +1,373 @@
# 7. Multi-GPU Workloads (Future Path)
> **RTX 5090:** Your first serious CUDA GPU — a stepping stone to multi-GPU and cloud GPU workflows
> **Why it matters:** The skills, code, and workflows you build on one GPU translate directly to multi-GPU and cloud infrastructure
---
## Why Think About Multi-GPU Now?
You don't need multiple GPUs today. But learning on a single RTX 5090 builds skills that scale directly:
```
┌──────────────────────────────────────────────────────────────────────┐
│ GPU Scaling Path │
│ │
│ TODAY │
│ ┌─────────────────────────────────┐ │
│ │ RTX 5090 Laptop (24 GB VRAM) │ ← You are here │
│ │ Single GPU, local inference │ │
│ │ Fine-tuning up to 13B models │ │
│ └────────────────┬────────────────┘ │
│ │ │
│ NEAR FUTURE ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Desktop + eGPU or 2× GPU tower │ │
│ │ 4896 GB total VRAM │ │
│ │ Fine-tune 70B models │ │
│ │ Run 23 models simultaneously │ │
│ └────────────────┬────────────────┘ │
│ │ │
│ CLOUD / PROD ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Cloud GPU instances │ │
│ │ A100/H100 × 28 (80640 GB) │ │
│ │ Train large models │ │
│ │ Serve at scale │ │
│ └─────────────────────────────────┘ │
│ │
│ SAME CUDA code works at every level ↑ │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
---
## What Multi-GPU Enables
| Capability | Single GPU (24 GB) | 2× GPU (48 GB) | 4× GPU (96 GB) | Cloud (640 GB) |
| --------------------- | ------------------ | --------------- | --------------- | -------------- |
| 7B inference | ✅ Very fast | ✅ + concurrent | ✅ + concurrent | ✅ at scale |
| 32B inference | ✅ Fits in VRAM | ✅ Very fast | ✅ Very fast | ✅ at scale |
| 70B inference | ⚠️ Partial GPU | ✅ Full GPU | ✅ Very fast | ✅ at scale |
| 7B fine-tune (QLoRA) | ✅ Comfortable | ✅ Faster | ✅ Faster | ✅ Fastest |
| 13B fine-tune (QLoRA) | ✅ Fits | ✅ Comfortable | ✅ Fast | ✅ Fastest |
| 70B fine-tune (QLoRA) | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable |
| 7B full fine-tune | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable |
| Serve 3+ models | ❌ VRAM limit | ✅ Yes | ✅ Yes | ✅ Yes |
| SDXL + LLM concurrent | ⚠️ Tight | ✅ Yes | ✅ Yes | ✅ Yes |
---
## Expansion Options
### Option 1: eGPU (Thunderbolt/USB4)
Connect an external GPU to your Razer Blade via Thunderbolt 4:
```
┌──────────────────────────────────┐ Thunderbolt 4 ┌──────────────────────┐
│ Razer Blade 18 │◄═══(~40 Gbps)════════►│ eGPU Enclosure │
│ RTX 5090 (24 GB) — internal │ │ RTX 4090 (24 GB) │
│ │ │ or RTX 5090 (24 GB) │
└──────────────────────────────────┘ └──────────────────────┘
Total VRAM: 48 GB (24 + 24)
Limitation: Thunderbolt bandwidth (~40 Gbps) is slower than PCIe 5.0 (~128 Gbps)
Best for: Model serving (latency-tolerant), not training (bandwidth-sensitive)
```
**Recommended eGPU enclosures:**
| Enclosure | Price | GPU Support |
|-----------|-------|-------------|
| Razer Core X | ~$300 | Full-length desktop GPUs |
| Sonnet Breakaway | ~$250 | Most desktop GPUs |
| Akitio Node | ~$200 | Compact form factor |
### Option 2: Desktop Build (Maximum Performance)
Build a dedicated GPU workstation:
```
┌──────────────────────────────────────────────────────────────────────┐
│ Desktop GPU Workstation │
│ │
│ Motherboard: ASUS/MSI with 24× PCIe 5.0 x16 slots │
│ CPU: Intel i9 or AMD Ryzen 9 (enough PCIe lanes) │
│ RAM: 128 GB DDR5 │
│ GPU 1: RTX 5090 (24 GB GDDR7) │
│ GPU 2: RTX 5090 (24 GB GDDR7) │
│ Total VRAM: 48 GB │
│ PSU: 1200W+ (two 5090s draw ~600W under load) │
│ │
│ Cost: ~$5,0007,000 │
│ Performance: Near-datacenter for most workloads │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
### Option 3: Cloud GPU (On-Demand)
Rent GPU time when you need it:
| Provider | GPU | VRAM | Price/Hour | Best For |
| ----------- | --------- | ------ | ---------- | -------------------- |
| Lambda Labs | A100 80GB | 80 GB | ~$1.10 | Training |
| RunPod | A100 80GB | 80 GB | ~$1.64 | Training + inference |
| Vast.ai | RTX 4090 | 24 GB | ~$0.30 | Budget inference |
| AWS (p4d) | A100 ×8 | 640 GB | ~$32 | Large-scale training |
| Together AI | H100 | 80 GB | ~$2.50 | Fine-tuning API |
**Your RTX 5090 code runs identically on cloud GPUs** — same PyTorch, same CUDA.
---
## How to Prepare Now (Single GPU)
### 1. Write GPU-Agnostic Code
Structure your code so it works with any number of GPUs:
```python
"""GPU-agnostic model loading — works with 1 or N GPUs."""
import torch
def get_device():
"""Select the best available device."""
if torch.cuda.is_available():
# Multi-GPU: use DataParallel or DistributedDataParallel
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs")
return "cuda" # PyTorch handles multi-GPU distribution
return "cuda:0"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"
device = get_device()
model = MyModel().to(device)
# Wrap for multi-GPU (no-op on single GPU)
if torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)
```
### 2. Learn Model Parallelism Concepts
```python
"""Pipeline parallelism — split model layers across GPUs."""
# Example: split a large model across 2 GPUs
# GPU 0: layers 015
# GPU 1: layers 1631
# With Hugging Face Accelerate:
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
model = load_checkpoint_and_dispatch(
model,
checkpoint="./model-weights",
device_map="auto", # Automatically distributes across available GPUs
)
```
### 3. Ollama Multi-GPU (Already Supported)
Ollama can split a single large model across multiple GPUs:
```bash
# When you have 2 GPUs, Ollama auto-detects and splits
# A 70B model: 24 GB on GPU 0, 16 GB on GPU 1, rest in RAM
# Or manually control GPU assignment
CUDA_VISIBLE_DEVICES=0,1 ollama serve
# Check GPU allocation
nvidia-smi # Shows VRAM usage per GPU
```
### 4. vLLM (High-Throughput Inference Server)
```bash
# vLLM supports tensor parallelism across GPUs
pip install vllm
# Serve a model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9
# API compatible with OpenAI format
curl http://localhost:8000/v1/completions -d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"prompt": "Hello",
"max_tokens": 100
}'
```
---
## Scaling Scenarios for Your Projects
### Scenario 1: LysnrAI at Scale
```
TODAY (1× RTX 5090):
- 1 user, local inference
- Whisper + Ollama + TTS, one at a time
FUTURE (2× GPU desktop):
- GPU 0: Ollama (always-on coding model)
- GPU 1: Whisper + TTS (dedicated)
- Run all 3 workloads concurrently
PRODUCTION (Cloud):
- vLLM serving on A100
- Whisper on dedicated GPU
- TTS on dedicated GPU
- Handles 100+ concurrent users
```
### Scenario 2: Fine-Tuning Larger Models
```
TODAY (24 GB VRAM):
- QLoRA 7B13B models
- Training time: 14 hours
FUTURE (48 GB VRAM):
- QLoRA 70B models
- LoRA FP16 32B models
- Training time: 412 hours
CLOUD (80+ GB VRAM):
- Full fine-tune 7B13B models
- QLoRA 70B+ models
- Training time: hours with H100
```
### Scenario 3: Image + Text Generation Pipeline
```
TODAY (1× GPU):
- SDXL OR LLM, not both at once (VRAM constraint)
FUTURE (2× GPU):
- GPU 0: LLM (Ollama, 32B model, ~19 GB)
- GPU 1: SDXL + ControlNet (~10 GB)
- Generate images guided by LLM descriptions
PRODUCTION:
- Automated content pipeline:
LLM writes description → SDXL generates image → Whisper adds voiceover
```
---
## Benefits of Starting Single-GPU
| Benefit | Description |
| ------------------------ | ------------------------------------------------------------------- |
| **Code portability** | CUDA code runs the same on 1, 2, or 100 GPUs |
| **Skill foundation** | Memory management, profiling, optimization skills transfer directly |
| **Cost efficiency** | Learn on local hardware ($0) before renting cloud ($$$) |
| **Workflow development** | Build training pipelines, inference servers, batch scripts now |
| **Hardware literacy** | Understand VRAM limits, bandwidth, PCIe bottlenecks |
---
## Skills You'll Build Toward
| Skill | Single GPU (Now) | Multi-GPU (Future) | Career Impact |
| ------------------------ | ------------------ | ---------------------------- | ----------------------- |
| **CUDA programming** | Kernels, memory | NCCL, all-reduce | ML Infrastructure |
| **Model parallelism** | Understand concept | Implement tensor/pipeline | Senior ML Engineer |
| **Distributed training** | Data loading | Multi-node coordination | ML Platform Engineer |
| **Inference serving** | Ollama, local API | vLLM, Triton, load balancing | MLOps / Production |
| **GPU monitoring** | nvidia-smi, nvtop | Cluster monitoring | DevOps / SRE |
| **Cost optimization** | VRAM budget | Spot instances, right-sizing | FinOps / ML Engineering |
| **Batch scheduling** | Cron jobs | Kubernetes, Slurm | ML Platform |
### Learning Path
```
┌──────────────────────────────────────────────────────────────────────┐
│ Skills Progression │
│ │
│ Level 1 (Now — RTX 5090 Single GPU) │
│ ├── PyTorch + CUDA basics │
│ ├── Ollama model serving │
│ ├── QLoRA fine-tuning 7B models │
│ ├── nvidia-smi monitoring │
│ └── TensorRT basic optimization │
│ │
│ Level 2 (6 months — Same GPU, deeper skills) │
│ ├── Custom Triton kernels │
│ ├── vLLM inference server │
│ ├── Advanced quantization (AWQ, GPTQ) │
│ ├── Profiling and optimization │
│ └── Model merging and adapter stacking │
│ │
│ Level 3 (12 months — Multi-GPU or Cloud) │
│ ├── Multi-GPU inference (tensor parallelism) │
│ ├── Distributed training (DDP, FSDP) │
│ ├── Cloud GPU workflow (Lambda, RunPod) │
│ ├── Production serving with autoscaling │
│ └── NCCL and multi-node communication │
│ │
│ Level 4 (Future — Production ML) │
│ ├── Kubernetes + GPU scheduling │
│ ├── Model serving at scale (thousands of requests/sec) │
│ ├── Training pipelines with experiment tracking │
│ ├── Custom model architectures │
│ └── Open-source ML contributions │
│ │
└──────────────────────────────────────────────────────────────────────┘
```
---
## Cost Planning
### Single GPU (Current)
| Item | Cost | Status |
| ---------------------------------- | ----------- | --------- |
| Razer Blade 18 RTX 5090 | $5,200 | Purchased |
| Electricity (~200W avg, 8 hrs/day) | ~$15/month | Ongoing |
| **Total first year** | **~$5,380** | |
### Desktop Upgrade (Future)
| Item | Estimated Cost |
| -------------------------- | -------------- |
| Desktop tower + PSU + RAM | ~$1,500 |
| RTX 5090 desktop GPU | ~$2,000 |
| Second RTX 5090 (optional) | ~$2,000 |
| **Total (2× GPU desktop)** | **~$5,500** |
### Cloud Alternative (Per-Use)
| Usage Pattern | Monthly Cost |
| ----------------------- | ------------ |
| 10 hours/month on A100 | ~$11 |
| 40 hours/month on A100 | ~$44 |
| 160 hours/month on A100 | ~$176 |
| Always-on A100 | ~$792 |
**Break-even vs desktop:** ~1218 months at heavy usage (40+ hours/month).
---
## Next Steps
- [ ] Write all training and inference scripts to be GPU-count agnostic
- [ ] Install and test vLLM on single GPU with Llama 3.1 8B
- [ ] Practice monitoring GPU memory and compute utilization
- [ ] Try model offloading: run a 70B model with partial CPU/GPU split
- [ ] Explore Lambda Labs or RunPod for a cloud GPU test ($510 experiment)
- [ ] Benchmark single GPU throughput to establish a baseline for comparison

View File

@ -0,0 +1,52 @@
# RTX 5090 Capabilities — Deep Dive Guides
> What you can do with the Razer Blade 18's RTX 5090 (24 GB GDDR7) that you can't (or can't do well) on the Mac.
Each guide covers: **what it is → how to use it → real-world use cases → benefits → skills you'll learn.**
---
## Guides
| # | Capability | Key Benefit | Skill Level |
| --------------------------------------- | --------------------------------- | ------------------------------------ | ------------ |
| [01](01-gpu-inference-speed.md) | **GPU Inference Speed** | 24× faster LLM responses | Beginner |
| [02](02-whisper-batch-transcription.md) | **Whisper Batch Transcription** | Hours of audio in minutes | Beginner |
| [03](03-tts-generation-at-scale.md) | **TTS Generation at Scale** | Faster-than-realtime voice synthesis | Beginner |
| [04](04-fine-tuning-training.md) | **Fine-Tuning / Training** | Customize models on your own data | Intermediate |
| [05](05-cuda-tensorrt-ml-research.md) | **CUDA / TensorRT / ML Research** | Full NVIDIA ML toolchain | Intermediate |
| [06](06-stable-diffusion-image-gen.md) | **Stable Diffusion / Image Gen** | 58s per image, unlimited free | Beginner |
| [07](07-multi-gpu-workloads.md) | **Multi-GPU Workloads (Future)** | Scaling path to production | Advanced |
---
## Suggested Learning Order
```
Week 1: 01 (Inference) → 02 (Whisper) → 03 (TTS)
Get familiar with the GPU, benchmark your models
Week 2: 06 (Stable Diffusion)
Set up ComfyUI, generate app assets
Week 3: 04 (Fine-Tuning)
QLoRA your first 7B model on your own code
Week 4: 05 (CUDA / TensorRT)
Deeper GPU programming, profiling, optimization
Ongoing: 07 (Multi-GPU)
Reference as you plan scaling
```
---
## Prerequisites
All guides assume you've completed the [Windows setup](../setup-guide.md):
- NVIDIA drivers installed (Windows)
- Ollama installed and running (Windows)
- WSL2 Ubuntu 24.04 set up
- Repo cloned, `setup-tts.sh` completed
- Dashboard running at `http://localhost:3000`