docs(local-llms): add 7 RTX 5090 capability deep-dive guides
New capabilities/ subfolder with detailed guides: - 01: GPU inference speed (benchmarks, Ollama tuning, API usage) - 02: Whisper batch transcription (scripts, Python integration, use cases) - 03: TTS generation at scale (Orpheus + Qwen3, batch scripts, voice cloning) - 04: Fine-tuning / training (LoRA, QLoRA, data prep, Ollama export) - 05: CUDA / TensorRT / ML research (toolchain setup, Triton kernels, profiling) - 06: Stable Diffusion / image gen (ComfyUI, SDXL, FLUX, batch generation) - 07: Multi-GPU workloads (scaling path, eGPU, cloud, cost planning) - README: index with learning order and prerequisites Each guide covers: what it is, how to use it, benefits, skills to learn
This commit is contained in:
parent
1650e0da6c
commit
6d18344fe0
@ -0,0 +1,174 @@
|
||||
# 1. Raw GPU Inference Speed
|
||||
|
||||
> **RTX 5090:** 2–4× faster inference on all models ≤32B compared to Mac M4 Pro
|
||||
> **Why it matters:** Faster coding suggestions, faster conversations, faster iteration
|
||||
|
||||
---
|
||||
|
||||
## What Is GPU Inference?
|
||||
|
||||
When you run a model like `qwen2.5-coder:32b` through Ollama, the GPU does the heavy lifting — multiplying billions of numbers (matrix operations) to generate each token of the response. The speed of this process depends on:
|
||||
|
||||
1. **VRAM bandwidth** — how fast data moves within the GPU
|
||||
2. **Compute cores** — how many operations run in parallel
|
||||
3. **VRAM capacity** — whether the full model fits without spilling to CPU RAM
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Token Generation Pipeline │
|
||||
│ │
|
||||
│ Prompt → [Tokenize] → [GPU: Matrix Multiply] → [Sample] → Token │
|
||||
│ ▲ │
|
||||
│ │ │
|
||||
│ This is the bottleneck. │
|
||||
│ RTX 5090 does this 2–4× faster. │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance: Mac vs Razer
|
||||
|
||||
| Model | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
|
||||
| ------------------------- | ---------------- | --------------------- | ------- |
|
||||
| llama3.1:8b (4.9 GB) | ~50–70 tok/s | ~100–150 tok/s | ~2× |
|
||||
| qwen2.5-coder:7b (4.7 GB) | ~40–60 tok/s | ~80–120 tok/s | ~2× |
|
||||
| qwen2.5-coder:32b (19 GB) | ~15–25 tok/s | ~40–60 tok/s | ~2.5× |
|
||||
| deepseek-r1:32b (19 GB) | ~15–25 tok/s | ~40–60 tok/s | ~2.5× |
|
||||
| sematre/orpheus:en (4 GB) | ~realtime | ~2–3× realtime | ~2.5× |
|
||||
|
||||
### Why the RTX 5090 Is Faster
|
||||
|
||||
```
|
||||
┌─────────────────────────┬──────────────────────┬──────────────────────────┐
|
||||
│ Metric │ Mac M4 Pro │ RTX 5090 │
|
||||
├─────────────────────────┼──────────────────────┼──────────────────────────┤
|
||||
│ GPU memory bandwidth │ ~273 GB/s (shared) │ ~1,000+ GB/s (GDDR7) │
|
||||
│ Compute cores │ 20 Metal cores │ ~18,000 CUDA cores │
|
||||
│ Tensor cores │ None (Neural Engine) │ 5th/6th gen Tensor cores │
|
||||
│ FP16 throughput │ ~25 TFLOPS │ ~200+ TFLOPS │
|
||||
│ Model in memory? │ Yes (unified 48 GB) │ Yes (24 GB VRAM) │
|
||||
└─────────────────────────┴──────────────────────┴──────────────────────────┘
|
||||
```
|
||||
|
||||
The RTX 5090's GDDR7 bandwidth is ~4× higher, and it has ~8× the raw FP16 compute throughput. The actual speedup is 2–4× (not 8×) because inference is mostly **memory-bandwidth bound**, not compute-bound — the GPU spends most of its time reading model weights, not computing.
|
||||
|
||||
---
|
||||
|
||||
## How to Use It
|
||||
|
||||
### Basic: Ollama (Already Set Up)
|
||||
|
||||
Ollama runs natively on Windows and uses CUDA automatically. No extra config needed.
|
||||
|
||||
```bash
|
||||
# From WSL2 or Windows terminal
|
||||
ollama run qwen2.5-coder:32b "Write a Fastify route that validates input with Zod"
|
||||
```
|
||||
|
||||
### Interactive Coding Assistant
|
||||
|
||||
```bash
|
||||
# Start a conversation with the 32B coding model
|
||||
ollama run qwen2.5-coder:32b
|
||||
|
||||
# Or use the 7B model for quick questions (faster response start)
|
||||
ollama run qwen2.5-coder:7b
|
||||
```
|
||||
|
||||
### From the Dashboard
|
||||
|
||||
```bash
|
||||
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
|
||||
bash start-dashboard.sh
|
||||
# Open http://localhost:3000 — model status + inference visible
|
||||
```
|
||||
|
||||
### Benchmark Your Actual Speed
|
||||
|
||||
```bash
|
||||
# Quick benchmark — measure tokens per second
|
||||
time ollama run qwen2.5-coder:7b "Write a Python function that implements binary search" --verbose 2>&1 | tail -5
|
||||
|
||||
# Compare models
|
||||
for model in llama3.1:8b qwen2.5-coder:7b qwen2.5-coder:32b; do
|
||||
echo "=== $model ==="
|
||||
ollama run "$model" "Hello world" --verbose 2>&1 | grep "eval rate"
|
||||
done
|
||||
```
|
||||
|
||||
### API Access (for Scripts/Apps)
|
||||
|
||||
```bash
|
||||
# Ollama exposes a REST API at localhost:11434
|
||||
curl -s http://localhost:11434/api/generate -d '{
|
||||
"model": "qwen2.5-coder:32b",
|
||||
"prompt": "Explain CUDA memory hierarchy in 3 sentences",
|
||||
"stream": false
|
||||
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
### For Your LysnrAI / MindLyst Projects
|
||||
|
||||
- **Faster code generation** — 32B model responses in ~0.5–1.5s vs ~2–4s on Mac
|
||||
- **More context in less time** — can process longer prompts without waiting
|
||||
- **Better for agentic workflows** — LangGraph agents that call LLMs multiple times per step run 2–4× faster end-to-end
|
||||
- **Batch processing** — generate embeddings, summaries, or classifications for hundreds of items quickly
|
||||
|
||||
### For Daily Coding
|
||||
|
||||
- **Near-instant small model responses** — 7B at 80–120 tok/s feels like reading speed
|
||||
- **Viable 32B coding assistant** — 40–60 tok/s is fast enough for real-time pair programming
|
||||
- **DeepSeek-R1 reasoning** — chain-of-thought at 40–60 tok/s makes complex reasoning practical
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Why It Matters |
|
||||
| -------------------------- | -------------------------------------------------------------------- | -------------------------------------- |
|
||||
| **GPU memory management** | How VRAM allocation works, model offloading, quantization trade-offs | Core ML engineering skill |
|
||||
| **CUDA profiling** | Using `nvidia-smi`, `nvtop`, watching GPU utilization | Essential for optimizing AI workloads |
|
||||
| **Quantization** | Q4 vs Q8 vs FP16 — speed/quality trade-offs | Industry-standard model deployment |
|
||||
| **Inference optimization** | Batch size, context length, KV cache tuning | Key for production AI systems |
|
||||
| **Model selection** | When to use 7B vs 32B vs 70B for different tasks | Practical AI engineering judgment |
|
||||
| **REST API integration** | Building apps that call local LLM APIs | Directly applicable to LysnrAI backend |
|
||||
|
||||
---
|
||||
|
||||
## Advanced: Tuning Ollama for Performance
|
||||
|
||||
```bash
|
||||
# Set number of GPU layers (default: all)
|
||||
# Useful if you want to run 2 models with partial GPU offload
|
||||
OLLAMA_NUM_GPU=999 ollama serve
|
||||
|
||||
# Monitor GPU during inference
|
||||
watch -n 0.5 nvidia-smi
|
||||
|
||||
# Or install nvtop for a richer GPU monitor
|
||||
sudo apt install nvtop
|
||||
nvtop
|
||||
```
|
||||
|
||||
### Ollama Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
| -------------------------- | ------- | --------------------------------------- |
|
||||
| `OLLAMA_NUM_PARALLEL` | 1 | Concurrent request slots |
|
||||
| `OLLAMA_MAX_LOADED_MODELS` | 1 | Models kept in VRAM simultaneously |
|
||||
| `OLLAMA_FLASH_ATTENTION` | true | Use flash attention (faster, less VRAM) |
|
||||
| `OLLAMA_GPU_OVERHEAD` | 0 | Reserve VRAM (bytes) for other apps |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Benchmark all 5 models on the Razer and record actual tok/s
|
||||
- [ ] Try `OLLAMA_NUM_PARALLEL=4` for concurrent requests
|
||||
- [ ] Experiment with `OLLAMA_MAX_LOADED_MODELS=2` to keep 7B + 32B hot
|
||||
- [ ] Build a simple script that compares Mac vs Razer inference times
|
||||
@ -0,0 +1,306 @@
|
||||
# 2. Whisper Batch Transcription
|
||||
|
||||
> **RTX 5090:** 8–15× realtime transcription vs 2–4× on Mac
|
||||
> **Why it matters:** Hours of audio transcribed in minutes — unlocks bulk audio workflows
|
||||
|
||||
---
|
||||
|
||||
## What Is Whisper?
|
||||
|
||||
[Whisper](https://github.com/openai/whisper) is OpenAI's open-source speech-to-text model. We use [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — a high-performance C/C++ implementation that supports CUDA GPU acceleration.
|
||||
|
||||
The `large-v3-turbo` model (~1.5 GB) delivers near-human accuracy across 99 languages.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Whisper Transcription Pipeline │
|
||||
│ │
|
||||
│ Audio File (.wav/.mp3) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ [FFmpeg: decode + resample to 16kHz mono] │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ [Whisper: Mel spectrogram → Encoder → Decoder → Text] │
|
||||
│ │ ▲ │
|
||||
│ │ │ GPU accelerates this (CUDA or Metal) │
|
||||
│ ▼ │
|
||||
│ Transcript (.txt / .srt / .vtt / .json) │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance: Mac vs Razer
|
||||
|
||||
| Audio Length | Mac M4 Pro (Metal) | Razer RTX 5090 (CUDA) | Speedup |
|
||||
| ------------ | ------------------ | --------------------- | ------- |
|
||||
| 1 minute | ~15–30s | ~4–8s | ~3× |
|
||||
| 10 minutes | ~2.5–5 min | ~40–80s | ~3× |
|
||||
| 1 hour | ~15–30 min | ~4–8 min | ~3× |
|
||||
| 10 hours | ~2.5–5 hours | ~40–80 min | ~3–4× |
|
||||
| 100 hours | ~1–2 days | ~7–13 hours | ~3–4× |
|
||||
|
||||
### Realtime Multiplier
|
||||
|
||||
| Machine | Speed | Meaning |
|
||||
| -------------- | --------------- | ------------------------ |
|
||||
| Mac M4 Pro | ~2–4× realtime | 1 hour audio → 15–30 min |
|
||||
| Razer RTX 5090 | ~8–15× realtime | 1 hour audio → 4–8 min |
|
||||
|
||||
---
|
||||
|
||||
## How to Use It
|
||||
|
||||
### Single File Transcription
|
||||
|
||||
```bash
|
||||
# Basic transcription
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
|
||||
|
||||
# With timestamps (SRT format for subtitles)
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -osrt
|
||||
|
||||
# JSON output with word-level timestamps
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -ojf
|
||||
|
||||
# Specify language (skip auto-detect for speed)
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l en
|
||||
```
|
||||
|
||||
### Batch Transcription Script
|
||||
|
||||
Create this script to transcribe an entire folder of audio files:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# batch-transcribe.sh — Transcribe all audio files in a directory
|
||||
# Usage: bash batch-transcribe.sh /path/to/audio/files
|
||||
|
||||
INPUT_DIR="${1:-.}"
|
||||
MODEL="$HOME/whisper-models/ggml-large-v3-turbo.bin"
|
||||
OUTPUT_DIR="${INPUT_DIR}/transcripts"
|
||||
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
echo "=== Batch Whisper Transcription ==="
|
||||
echo "Input: $INPUT_DIR"
|
||||
echo "Output: $OUTPUT_DIR"
|
||||
echo "Model: ggml-large-v3-turbo"
|
||||
echo ""
|
||||
|
||||
TOTAL=0
|
||||
DONE=0
|
||||
START_TIME=$(date +%s)
|
||||
|
||||
# Count files
|
||||
for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
|
||||
[ -f "$f" ] && ((TOTAL++))
|
||||
done
|
||||
|
||||
echo "Found $TOTAL audio files"
|
||||
echo ""
|
||||
|
||||
for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do
|
||||
[ -f "$f" ] || continue
|
||||
((DONE++))
|
||||
|
||||
BASENAME=$(basename "$f" | sed 's/\.[^.]*$//')
|
||||
echo "[$DONE/$TOTAL] Transcribing: $(basename "$f")"
|
||||
|
||||
whisper-cli \
|
||||
-m "$MODEL" \
|
||||
-f "$f" \
|
||||
-l en \
|
||||
-otxt \
|
||||
-of "$OUTPUT_DIR/$BASENAME" \
|
||||
2>/dev/null
|
||||
|
||||
# Also generate SRT for subtitle use
|
||||
whisper-cli \
|
||||
-m "$MODEL" \
|
||||
-f "$f" \
|
||||
-l en \
|
||||
-osrt \
|
||||
-of "$OUTPUT_DIR/$BASENAME" \
|
||||
2>/dev/null
|
||||
|
||||
echo " -> $OUTPUT_DIR/$BASENAME.txt"
|
||||
echo " -> $OUTPUT_DIR/$BASENAME.srt"
|
||||
echo ""
|
||||
done
|
||||
|
||||
END_TIME=$(date +%s)
|
||||
ELAPSED=$((END_TIME - START_TIME))
|
||||
echo "=== Done! $DONE files in ${ELAPSED}s ==="
|
||||
```
|
||||
|
||||
### Convert Non-WAV Audio First
|
||||
|
||||
```bash
|
||||
# Whisper works best with 16kHz mono WAV
|
||||
# Convert any audio format with ffmpeg
|
||||
|
||||
# Single file
|
||||
ffmpeg -i podcast.mp3 -ar 16000 -ac 1 podcast.wav
|
||||
|
||||
# Batch convert all MP3s in a folder
|
||||
for f in *.mp3; do
|
||||
ffmpeg -i "$f" -ar 16000 -ac 1 "${f%.mp3}.wav"
|
||||
done
|
||||
```
|
||||
|
||||
### Python Integration
|
||||
|
||||
```python
|
||||
"""Transcribe audio using whisper.cpp via subprocess."""
|
||||
import subprocess
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
def transcribe(audio_path: str, language: str = "en") -> dict:
|
||||
"""Transcribe an audio file and return structured result."""
|
||||
model = Path.home() / "whisper-models" / "ggml-large-v3-turbo.bin"
|
||||
output_base = Path(audio_path).stem
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
"whisper-cli",
|
||||
"-m", str(model),
|
||||
"-f", audio_path,
|
||||
"-l", language,
|
||||
"-ojf", # JSON with full metadata
|
||||
"-of", output_base,
|
||||
],
|
||||
capture_output=True, text=True, timeout=600,
|
||||
)
|
||||
|
||||
# Read the JSON output
|
||||
json_path = Path(f"{output_base}.json")
|
||||
if json_path.exists():
|
||||
with open(json_path) as f:
|
||||
return json.load(f)
|
||||
|
||||
return {"error": result.stderr, "text": result.stdout}
|
||||
|
||||
# Usage
|
||||
result = transcribe("meeting-recording.wav")
|
||||
print(result["transcription"][0]["text"])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases
|
||||
|
||||
### 1. LysnrAI Voice Dictation Pipeline
|
||||
|
||||
Your LysnrAI desktop app captures voice → sends to Whisper for transcription. On the Razer, this becomes near-instant:
|
||||
|
||||
```
|
||||
Voice input (5 seconds) → Whisper (CUDA) → Text in <1 second
|
||||
```
|
||||
|
||||
### 2. Meeting Transcription
|
||||
|
||||
```bash
|
||||
# Record a 1-hour Zoom meeting → get full transcript in ~5 minutes
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
-f zoom-meeting.wav -l en -otxt -osrt
|
||||
```
|
||||
|
||||
### 3. Podcast / YouTube Processing
|
||||
|
||||
```bash
|
||||
# Download YouTube audio
|
||||
yt-dlp -x --audio-format wav "https://youtube.com/watch?v=..." -o "video.wav"
|
||||
|
||||
# Transcribe
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f video.wav -otxt -osrt
|
||||
```
|
||||
|
||||
### 4. Subtitle Generation
|
||||
|
||||
```bash
|
||||
# Generate SRT subtitles for video
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
|
||||
-f movie.wav -l en -osrt
|
||||
|
||||
# Output: movie.srt — ready to import into video editors
|
||||
```
|
||||
|
||||
### 5. Multi-Language Transcription
|
||||
|
||||
```bash
|
||||
# Auto-detect language
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav
|
||||
|
||||
# Force specific language
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ja # Japanese
|
||||
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ta # Tamil
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
| --------------- | ------------------------------------------------------------ |
|
||||
| **Speed** | Process a full day of meetings in under an hour |
|
||||
| **Privacy** | All transcription runs locally — no data leaves your machine |
|
||||
| **Cost** | Zero API costs (vs $0.006/min for cloud Whisper API) |
|
||||
| **Accuracy** | large-v3-turbo is near-human accuracy for English |
|
||||
| **Offline** | Works without internet — useful on flights, trains |
|
||||
| **Batch scale** | Process hundreds of files overnight |
|
||||
|
||||
### Cost Comparison (100 hours of audio)
|
||||
|
||||
| Method | Cost | Time |
|
||||
| -------------------------- | ------ | ---------------- |
|
||||
| OpenAI Whisper API | ~$36 | ~minutes (cloud) |
|
||||
| Mac M4 Pro (local) | $0 | ~25–50 hours |
|
||||
| **Razer RTX 5090 (local)** | **$0** | **~7–13 hours** |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Career Value |
|
||||
| ---------------------------- | ------------------------------------------------------ | -------------------------------------- |
|
||||
| **Audio processing** | Sample rates, codecs, mono/stereo, WAV vs compressed | Foundational for any audio/speech work |
|
||||
| **Speech-to-text pipelines** | Mel spectrograms, encoder-decoder models, beam search | Core ML/NLP skill |
|
||||
| **CUDA acceleration** | How GPU parallelism speeds up neural network inference | Top ML engineering skill |
|
||||
| **Batch processing** | Shell scripting for processing thousands of files | DevOps / data engineering |
|
||||
| **Subtitle formats** | SRT, VTT, JSON — standards for timed text | Media tech / accessibility |
|
||||
| **Model quantization** | Understanding why ggml models are smaller and faster | ML deployment knowledge |
|
||||
| **ffmpeg mastery** | Audio/video conversion, resampling, format detection | Essential multimedia tool |
|
||||
| **Python subprocess** | Integrating CLI tools into Python applications | Backend engineering pattern |
|
||||
|
||||
---
|
||||
|
||||
## Monitoring GPU During Transcription
|
||||
|
||||
```bash
|
||||
# Watch GPU utilization in real-time
|
||||
watch -n 0.5 nvidia-smi
|
||||
|
||||
# Or use nvtop for a richer view
|
||||
sudo apt install nvtop
|
||||
nvtop
|
||||
|
||||
# Expected during Whisper:
|
||||
# GPU Utilization: 80–95%
|
||||
# VRAM Usage: ~2–3 GB (model + buffers)
|
||||
# Power: ~150–200W
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Transcribe a test audio file and verify output quality
|
||||
- [ ] Create `batch-transcribe.sh` in `__LOCAL_LLMs/scripts/`
|
||||
- [ ] Benchmark: time a 10-minute file on Razer vs Mac
|
||||
- [ ] Try multi-language transcription (English + Tamil)
|
||||
- [ ] Integrate Whisper output into LysnrAI transcription pipeline
|
||||
- [ ] Experiment with `whisper-cli --translate` for translation mode
|
||||
@ -0,0 +1,303 @@
|
||||
# 3. TTS Generation at Scale
|
||||
|
||||
> **RTX 5090:** Qwen3-TTS at 2–4× realtime, Orpheus at 2–3× realtime
|
||||
> **Why it matters:** Pre-generate audio libraries, build voice features, create content — all faster than real-time playback
|
||||
|
||||
---
|
||||
|
||||
## What Is Local TTS?
|
||||
|
||||
Text-to-Speech (TTS) converts written text into natural-sounding speech. Our stack has two engines:
|
||||
|
||||
| Engine | Architecture | Size | Voices | Quality |
|
||||
| ------------- | ------------------------------- | ----------- | ----------------------- | ------------------------------------- |
|
||||
| **Orpheus** | Ollama-served, SNAC decoder | 4 GB | 8 English voices | Extremely natural, emotional |
|
||||
| **Qwen3-TTS** | PyTorch model, direct inference | 0.6B params | 10 languages, cloneable | Multilingual, zero-shot voice cloning |
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ TTS Pipeline │
|
||||
│ │
|
||||
│ Orpheus: │
|
||||
│ Text → [Ollama: generate audio tokens] → [SNAC: decode to WAV] │
|
||||
│ ▲ GPU (CUDA) ▲ GPU (CUDA) │
|
||||
│ │
|
||||
│ Qwen3-TTS: │
|
||||
│ Text → [PyTorch model: text→mel→audio] → WAV file │
|
||||
│ ▲ GPU (CUDA / MPS) │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance: Mac vs Razer
|
||||
|
||||
| Engine | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
|
||||
| ---------------------------- | ---------------- | --------------------- | ------- |
|
||||
| Orpheus (per sentence) | ~realtime | ~2–3× realtime | ~2.5× |
|
||||
| Qwen3-TTS (per sentence) | ~realtime | ~2–4× realtime | ~3× |
|
||||
| Orpheus (10 min narration) | ~10 min | ~3–5 min | ~2.5× |
|
||||
| Qwen3-TTS (10 min narration) | ~10 min | ~2.5–5 min | ~3× |
|
||||
| Batch: 100 sentences | ~5–8 min | ~2–3 min | ~3× |
|
||||
| Batch: 1000 sentences | ~50–80 min | ~15–25 min | ~3× |
|
||||
|
||||
**"2–4× realtime" means:** A 10-second sentence generates in 2.5–5 seconds. The audio is produced faster than you could listen to it.
|
||||
|
||||
---
|
||||
|
||||
## How to Use It
|
||||
|
||||
### Orpheus TTS (8 Natural Voices)
|
||||
|
||||
Orpheus runs through Ollama + SNAC decoder. Already set up by `setup-tts.sh`.
|
||||
|
||||
```bash
|
||||
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
|
||||
|
||||
# Generate speech with default voice (tara)
|
||||
.venv-qwen-tts/bin/python test_orpheus_tts.py
|
||||
|
||||
# Output: test_orpheus_tara.wav, test_orpheus_leah.wav, etc.
|
||||
# Play: aplay test_orpheus_tara.wav
|
||||
```
|
||||
|
||||
#### Available Orpheus Voices
|
||||
|
||||
| Voice | Character | Best For |
|
||||
| ------ | ------------------ | --------------------- |
|
||||
| `tara` | Young female, warm | Narration, assistants |
|
||||
| `leah` | Female, clear | Tutorials, guides |
|
||||
| `jess` | Female, energetic | Announcements |
|
||||
| `leo` | Male, calm | Narration, podcasts |
|
||||
| `dan` | Male, professional | Business content |
|
||||
| `mia` | Female, friendly | Conversational |
|
||||
| `zac` | Male, young | Casual content |
|
||||
| `zoe` | Female, neutral | General purpose |
|
||||
|
||||
#### Custom Text with Orpheus (Python)
|
||||
|
||||
```python
|
||||
"""Generate speech from custom text using Orpheus TTS."""
|
||||
import json
|
||||
import struct
|
||||
import wave
|
||||
import urllib.request
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
|
||||
def generate_speech(text: str, voice: str = "tara", output_path: str = "output.wav"):
|
||||
"""Generate a WAV file from text using Orpheus via Ollama."""
|
||||
prompt = f"<|audio|>{voice}: {text}"
|
||||
|
||||
payload = json.dumps({
|
||||
"model": "sematre/orpheus:en",
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.6, "top_p": 0.9}
|
||||
}).encode()
|
||||
|
||||
req = urllib.request.Request(
|
||||
OLLAMA_URL,
|
||||
data=payload,
|
||||
headers={"Content-Type": "application/json"}
|
||||
)
|
||||
|
||||
with urllib.request.urlopen(req, timeout=120) as resp:
|
||||
result = json.loads(resp.read())
|
||||
|
||||
# Decode audio tokens → SNAC → WAV
|
||||
# (Full implementation in test_orpheus_tts.py)
|
||||
print(f"Generated: {output_path}")
|
||||
|
||||
# Example
|
||||
generate_speech(
|
||||
"Welcome to LysnrAI. Your voice-first productivity assistant.",
|
||||
voice="tara",
|
||||
output_path="welcome.wav"
|
||||
)
|
||||
```
|
||||
|
||||
### Qwen3-TTS (Multilingual + Voice Cloning)
|
||||
|
||||
```bash
|
||||
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
|
||||
|
||||
# Generate speech with Qwen3-TTS
|
||||
.venv-qwen-tts/bin/python test_qwen_tts.py
|
||||
|
||||
# Output: test_qwen3_tts_output.wav
|
||||
# Play: aplay test_qwen3_tts_output.wav
|
||||
```
|
||||
|
||||
#### Qwen3-TTS Features
|
||||
|
||||
| Feature | Description |
|
||||
| ------------------- | ----------------------------------------------------------------------------------------- |
|
||||
| **10 languages** | English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian |
|
||||
| **Voice cloning** | Provide a reference audio clip → model clones the voice |
|
||||
| **Emotion control** | Adjust speaking style via prompt engineering |
|
||||
| **0.6B parameters** | Small enough to run fast, large enough for quality |
|
||||
|
||||
### Batch TTS Generation Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# batch-tts.sh — Generate audio for a list of sentences
|
||||
# Usage: bash batch-tts.sh sentences.txt output_dir/
|
||||
|
||||
INPUT_FILE="${1:-sentences.txt}"
|
||||
OUTPUT_DIR="${2:-tts_output}"
|
||||
VOICE="${3:-tara}"
|
||||
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
echo "=== Batch Orpheus TTS ==="
|
||||
echo "Input: $INPUT_FILE"
|
||||
echo "Output: $OUTPUT_DIR"
|
||||
echo "Voice: $VOICE"
|
||||
echo ""
|
||||
|
||||
LINE_NUM=0
|
||||
while IFS= read -r line; do
|
||||
[ -z "$line" ] && continue
|
||||
((LINE_NUM++))
|
||||
|
||||
echo "[$LINE_NUM] Generating: ${line:0:60}..."
|
||||
|
||||
# Use the Python TTS script with custom text
|
||||
.venv-qwen-tts/bin/python -c "
|
||||
import test_orpheus_tts as tts
|
||||
tts.generate_and_save('$line', '$VOICE', '$OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav')
|
||||
" 2>/dev/null
|
||||
|
||||
echo " -> $OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav"
|
||||
done < "$INPUT_FILE"
|
||||
|
||||
echo ""
|
||||
echo "=== Done! $LINE_NUM files generated ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases
|
||||
|
||||
### 1. LysnrAI Voice Responses
|
||||
|
||||
Generate spoken responses from LLM output — the Razer can produce audio faster than the user can listen:
|
||||
|
||||
```
|
||||
User asks question → LLM generates text → TTS converts to speech → User hears answer
|
||||
▲
|
||||
2–4× realtime on RTX 5090
|
||||
Feels instant for short responses
|
||||
```
|
||||
|
||||
### 2. Pre-Generated Audio Libraries
|
||||
|
||||
Build a library of common phrases, greetings, and responses:
|
||||
|
||||
```bash
|
||||
# sentences.txt
|
||||
Welcome to LysnrAI.
|
||||
Your daily brief is ready.
|
||||
You have three new memories to review.
|
||||
Recording started.
|
||||
Recording saved successfully.
|
||||
Transcription complete.
|
||||
```
|
||||
|
||||
```bash
|
||||
# Generate all phrases in multiple voices
|
||||
for voice in tara leo dan; do
|
||||
bash batch-tts.sh sentences.txt audio_library/ "$voice"
|
||||
done
|
||||
```
|
||||
|
||||
### 3. Audiobook / Podcast Generation
|
||||
|
||||
```python
|
||||
# Split a document into paragraphs and generate audio for each
|
||||
paragraphs = open("article.txt").read().split("\n\n")
|
||||
|
||||
for i, para in enumerate(paragraphs):
|
||||
generate_speech(para, voice="leo", output_path=f"chapter_{i:03d}.wav")
|
||||
|
||||
# Concatenate with ffmpeg
|
||||
# ffmpeg -f concat -i filelist.txt -c copy full_audiobook.wav
|
||||
```
|
||||
|
||||
### 4. Multilingual Content (Qwen3-TTS)
|
||||
|
||||
```python
|
||||
# Generate the same message in multiple languages
|
||||
messages = {
|
||||
"en": "Welcome to MindLyst, your AI-powered life organizer.",
|
||||
"ja": "MindLystへようこそ。AIライフオーガナイザーです。",
|
||||
"es": "Bienvenido a MindLyst, tu organizador de vida con IA.",
|
||||
}
|
||||
|
||||
for lang, text in messages.items():
|
||||
generate_qwen_tts(text, output_path=f"welcome_{lang}.wav")
|
||||
```
|
||||
|
||||
### 5. Voice Cloning (Qwen3-TTS)
|
||||
|
||||
```python
|
||||
# Clone a voice from a reference audio sample
|
||||
# Provide a 5–15 second reference clip of the target voice
|
||||
reference_audio = "my_voice_sample.wav"
|
||||
text = "This is my cloned voice saying something new."
|
||||
|
||||
# Qwen3-TTS can reproduce the voice characteristics
|
||||
generate_qwen_tts(text, reference=reference_audio, output_path="cloned.wav")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
| ----------------- | -------------------------------------------------------- |
|
||||
| **Speed** | Generate audio faster than playback speed |
|
||||
| **Privacy** | All voice generation runs locally — no cloud APIs |
|
||||
| **Cost** | $0 vs $0.015/1K chars for cloud TTS (Google, ElevenLabs) |
|
||||
| **Voice variety** | 8 Orpheus voices + unlimited Qwen3-TTS voice cloning |
|
||||
| **Multilingual** | Qwen3-TTS supports 10 languages natively |
|
||||
| **Offline** | Works without internet |
|
||||
| **Customizable** | Control emotion, speed, voice characteristics |
|
||||
|
||||
### Cost Comparison (Generate 1 hour of audio)
|
||||
|
||||
| Method | Cost | Time |
|
||||
| -------------------------- | ------- | ---------------- |
|
||||
| ElevenLabs API | ~$15–30 | ~minutes (cloud) |
|
||||
| Google Cloud TTS | ~$4–16 | ~minutes (cloud) |
|
||||
| Mac M4 Pro (local) | $0 | ~60 min |
|
||||
| **Razer RTX 5090 (local)** | **$0** | **~15–30 min** |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Career Value |
|
||||
| ------------------------- | ------------------------------------------------------- | ------------------------------ |
|
||||
| **Speech synthesis** | How neural TTS works (text→tokens→mel→audio) | Core speech/NLP skill |
|
||||
| **Audio codecs** | SNAC, Encodec, WAV format, sample rates | Audio engineering fundamentals |
|
||||
| **Voice cloning** | Zero-shot voice cloning techniques | Cutting-edge ML research area |
|
||||
| **Batch processing** | Automating large-scale audio generation | Production engineering |
|
||||
| **GPU memory** | Managing VRAM for concurrent TTS + LLM workloads | ML ops knowledge |
|
||||
| **Audio post-processing** | ffmpeg: concatenation, normalization, format conversion | Multimedia engineering |
|
||||
| **API design** | Building REST APIs around TTS engines | Backend engineering |
|
||||
| **Multilingual NLP** | Cross-language text processing and pronunciation | Global product development |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Generate test audio with both Orpheus and Qwen3-TTS on Razer
|
||||
- [ ] Create `batch-tts.sh` in `__LOCAL_LLMs/scripts/`
|
||||
- [ ] Build a pre-generated audio library for LysnrAI common phrases
|
||||
- [ ] Experiment with Qwen3-TTS voice cloning using your own voice
|
||||
- [ ] Try generating audio in Tamil (Qwen3-TTS multilingual)
|
||||
- [ ] Measure actual generation speed: words-per-second on each engine
|
||||
@ -0,0 +1,322 @@
|
||||
# 4. Fine-Tuning / Training
|
||||
|
||||
> **RTX 5090:** 24 GB VRAM enables LoRA fine-tuning of 7B–13B models locally
|
||||
> **Why it matters:** Customize models for your specific use cases — coding style, domain knowledge, voice commands
|
||||
|
||||
---
|
||||
|
||||
## What Is Fine-Tuning?
|
||||
|
||||
Fine-tuning takes a pre-trained model (like Llama 3.1 8B) and trains it further on your own data to specialize its behavior. Instead of training from scratch (which costs millions), you adjust a small fraction of the model's weights.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Fine-Tuning vs Prompting │
|
||||
│ │
|
||||
│ Prompting: "You are a coding assistant for TypeScript..." │
|
||||
│ Works OK, but limited by context window │
|
||||
│ Model doesn't truly "learn" your preferences │
|
||||
│ │
|
||||
│ Fine-Tuning: Train on 1000s of your code examples │
|
||||
│ Model internalizes your coding patterns │
|
||||
│ Better quality, no prompt overhead, faster inference │
|
||||
│ │
|
||||
│ LoRA: Fine-tune only ~1–5% of parameters │
|
||||
│ Needs 16–24 GB VRAM (fits RTX 5090) │
|
||||
│ Training time: hours, not days │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Why Not on Mac?
|
||||
|
||||
| Aspect | Mac M4 Pro (MPS) | RTX 5090 (CUDA) |
|
||||
| ----------------- | ------------------------------ | ------------------ |
|
||||
| Training support | Limited MPS support | Full CUDA + cuDNN |
|
||||
| Framework support | PyTorch MPS (some ops missing) | Full PyTorch CUDA |
|
||||
| VRAM for training | ~30 GB usable (shared) | 24 GB dedicated |
|
||||
| Memory bandwidth | ~273 GB/s | ~1,000+ GB/s |
|
||||
| Training speed | 5–10× slower than CUDA | Baseline |
|
||||
| LoRA libraries | Partial compatibility | Full compatibility |
|
||||
|
||||
**Training is compute AND memory bandwidth intensive** — the RTX 5090's ~1 TB/s VRAM bandwidth makes it 5–10× faster than MPS for gradient computation.
|
||||
|
||||
---
|
||||
|
||||
## Fine-Tuning Methods
|
||||
|
||||
### LoRA (Low-Rank Adaptation) — Recommended
|
||||
|
||||
Trains only small "adapter" matrices (~1–5% of model parameters). The base model stays frozen.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ LoRA Architecture │
|
||||
│ │
|
||||
│ Base Model (frozen, ~16 GB) │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ Layer 1: [Attention] [FFN] │ │
|
||||
│ │ Layer 2: [Attention] [FFN] │ ← Not modified │
|
||||
│ │ ... │ │
|
||||
│ │ Layer N: [Attention] [FFN] │ │
|
||||
│ └─────────────────────────────────────────────┘ │
|
||||
│ ↕ small adapters (rank 8–64) │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ LoRA Adapter A (64 KB per layer) │ ← Trainable │
|
||||
│ │ LoRA Adapter B (64 KB per layer) │ ← Trainable │
|
||||
│ └─────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Total trainable params: ~10–50 MB (vs 8–16 GB base) │
|
||||
│ VRAM needed: ~18–22 GB for 7B model │
|
||||
│ Training time: ~1–4 hours for 1000 examples │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### QLoRA (Quantized LoRA) — For Larger Models
|
||||
|
||||
Loads the base model in 4-bit quantization, trains LoRA adapters in FP16. Halves memory requirements.
|
||||
|
||||
| Method | 7B Model VRAM | 13B Model VRAM | 32B Model VRAM |
|
||||
| ----------------- | ------------- | -------------- | -------------- |
|
||||
| Full fine-tune | ~56 GB | ~104 GB | ~256 GB |
|
||||
| LoRA (FP16) | ~18 GB | ~32 GB | ~72 GB |
|
||||
| **QLoRA (4-bit)** | **~8 GB** | **~14 GB** | **~22 GB** |
|
||||
|
||||
**QLoRA at 32B fits in 24 GB VRAM** — this is remarkable for a laptop.
|
||||
|
||||
---
|
||||
|
||||
## How to Use It
|
||||
|
||||
### Setup: Install Training Libraries
|
||||
|
||||
```bash
|
||||
# Activate the existing TTS venv (or create a new one)
|
||||
cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs
|
||||
python3.12 -m venv .venv-training
|
||||
source .venv-training/bin/activate
|
||||
|
||||
# Install training stack
|
||||
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
|
||||
pip install transformers datasets accelerate peft bitsandbytes trl
|
||||
pip install wandb # optional: experiment tracking
|
||||
```
|
||||
|
||||
### Example 1: Fine-Tune Llama 3.1 8B for Your Coding Style
|
||||
|
||||
```python
|
||||
"""Fine-tune Llama 3.1 8B with LoRA on your coding examples."""
|
||||
from datasets import Dataset
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
|
||||
from peft import LoraConfig, get_peft_model
|
||||
from trl import SFTTrainer
|
||||
|
||||
# 1. Prepare your data
|
||||
# Format: instruction → response pairs from YOUR code
|
||||
training_data = [
|
||||
{
|
||||
"instruction": "Write a Fastify route that creates a user with Zod validation",
|
||||
"response": """import { z } from 'zod';
|
||||
import type { FastifyInstance } from 'fastify';
|
||||
|
||||
const CreateUserSchema = z.object({
|
||||
email: z.string().email(),
|
||||
name: z.string().min(1).max(100),
|
||||
productId: z.string(),
|
||||
});
|
||||
|
||||
export async function userRoutes(app: FastifyInstance) {
|
||||
app.post('/users', async (req, reply) => {
|
||||
const body = CreateUserSchema.parse(req.body);
|
||||
const user = await req.server.userRepository.create(body);
|
||||
return reply.status(201).send(user);
|
||||
});
|
||||
}"""
|
||||
},
|
||||
# Add 100–1000 more examples from your actual codebase
|
||||
]
|
||||
|
||||
dataset = Dataset.from_list([
|
||||
{"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"}
|
||||
for d in training_data
|
||||
])
|
||||
|
||||
# 2. Load model in 4-bit (QLoRA)
|
||||
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
load_in_4bit=True,
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
# 3. Configure LoRA
|
||||
lora_config = LoraConfig(
|
||||
r=16, # Rank (higher = more capacity, more VRAM)
|
||||
lora_alpha=32, # Scaling factor
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
lora_dropout=0.05,
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM",
|
||||
)
|
||||
|
||||
model = get_peft_model(model, lora_config)
|
||||
model.print_trainable_parameters()
|
||||
# trainable params: ~10M / total: ~8B (0.13%)
|
||||
|
||||
# 4. Train
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./lora-llama-coding",
|
||||
num_train_epochs=3,
|
||||
per_device_train_batch_size=2,
|
||||
gradient_accumulation_steps=4,
|
||||
learning_rate=2e-4,
|
||||
warmup_steps=10,
|
||||
logging_steps=10,
|
||||
save_steps=100,
|
||||
fp16=True,
|
||||
)
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model=model,
|
||||
train_dataset=dataset,
|
||||
args=training_args,
|
||||
tokenizer=tokenizer,
|
||||
dataset_text_field="text",
|
||||
max_seq_length=2048,
|
||||
)
|
||||
|
||||
trainer.train()
|
||||
|
||||
# 5. Save the adapter (small file, ~50 MB)
|
||||
model.save_pretrained("./lora-llama-coding")
|
||||
tokenizer.save_pretrained("./lora-llama-coding")
|
||||
```
|
||||
|
||||
### Example 2: Fine-Tune for LysnrAI Voice Commands
|
||||
|
||||
```python
|
||||
# Train a model to understand your specific voice command patterns
|
||||
training_data = [
|
||||
{"instruction": "Parse: remind me to call john tomorrow at 3pm",
|
||||
"response": '{"action": "reminder", "contact": "john", "time": "tomorrow 3pm", "task": "call"}'},
|
||||
{"instruction": "Parse: add milk to my grocery list",
|
||||
"response": '{"action": "add_item", "list": "grocery", "item": "milk"}'},
|
||||
{"instruction": "Parse: summarize my meeting notes from yesterday",
|
||||
"response": '{"action": "summarize", "source": "meeting_notes", "date": "yesterday"}'},
|
||||
# ... hundreds more examples
|
||||
]
|
||||
```
|
||||
|
||||
### Example 3: Convert LoRA to Ollama Model
|
||||
|
||||
After training, merge the LoRA adapter and convert to GGUF for Ollama:
|
||||
|
||||
```bash
|
||||
# Merge LoRA adapter back into base model
|
||||
python -c "
|
||||
from peft import AutoPeftModelForCausalLM
|
||||
|
||||
model = AutoPeftModelForCausalLM.from_pretrained('./lora-llama-coding')
|
||||
merged = model.merge_and_unload()
|
||||
merged.save_pretrained('./merged-llama-coding')
|
||||
"
|
||||
|
||||
# Convert to GGUF (requires llama.cpp)
|
||||
cd ~/llama.cpp
|
||||
python convert_hf_to_gguf.py ../merged-llama-coding --outtype q4_k_m
|
||||
|
||||
# Create Ollama model
|
||||
cat > Modelfile <<EOF
|
||||
FROM ./merged-llama-coding.gguf
|
||||
SYSTEM "You are a TypeScript coding assistant specialized in Fastify, Zod, and Azure Cosmos DB."
|
||||
EOF
|
||||
|
||||
ollama create my-coding-model -f Modelfile
|
||||
ollama run my-coding-model
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What Models Can You Fine-Tune on 24 GB VRAM?
|
||||
|
||||
| Model | Method | VRAM Usage | Feasible? | Training Time (1K examples) |
|
||||
| ------------- | ----------- | ---------- | -------------- | --------------------------- |
|
||||
| Llama 3.1 8B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~1–2 hours |
|
||||
| Qwen 2.5 7B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~1–2 hours |
|
||||
| Llama 3.1 8B | LoRA FP16 | ~18 GB | ✅ Fits | ~2–3 hours |
|
||||
| Mistral 7B | QLoRA 4-bit | ~8 GB | ✅ Comfortable | ~1–2 hours |
|
||||
| Llama 3.1 13B | QLoRA 4-bit | ~14 GB | ✅ Fits | ~3–5 hours |
|
||||
| CodeLlama 34B | QLoRA 4-bit | ~22 GB | ⚠️ Tight | ~6–10 hours |
|
||||
| Llama 3.1 70B | QLoRA 4-bit | ~40 GB | ❌ Too large | Need multi-GPU |
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
| -------------------- | ------------------------------------------------------------------- |
|
||||
| **Personalization** | Model learns YOUR coding style, patterns, and conventions |
|
||||
| **Domain expertise** | Train on your project's specific API patterns, schemas, terminology |
|
||||
| **Smaller + faster** | A fine-tuned 7B can outperform a generic 32B on your specific tasks |
|
||||
| **Privacy** | Your training data never leaves your machine |
|
||||
| **Cost** | $0 vs $100–1000s for cloud fine-tuning (OpenAI, Together AI) |
|
||||
| **Iteration speed** | Train → test → adjust in hours, not days |
|
||||
| **Portable output** | Export to GGUF/Ollama — runs anywhere |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Career Value |
|
||||
| -------------------------- | -------------------------------------------------------- | --------------------------------------------- |
|
||||
| **LoRA / QLoRA** | Parameter-efficient fine-tuning techniques | Top ML engineering skill (2024–2026 standard) |
|
||||
| **Hugging Face ecosystem** | transformers, datasets, peft, trl, accelerate | Industry-standard ML tooling |
|
||||
| **Training loops** | Loss functions, learning rates, gradient accumulation | Core ML fundamentals |
|
||||
| **Data preparation** | Curating, cleaning, formatting training datasets | Critical for any ML project |
|
||||
| **Quantization** | 4-bit, 8-bit, FP16 — trade-offs and techniques | Essential for deployment |
|
||||
| **Model evaluation** | Perplexity, human eval, A/B testing fine-tuned vs base | ML product development |
|
||||
| **VRAM management** | Gradient checkpointing, mixed precision, batch sizing | GPU optimization |
|
||||
| **Model merging** | Merging LoRA adapters, converting to GGUF | ML deployment pipeline |
|
||||
| **Experiment tracking** | Weights & Biases, training curves, hyperparameter tuning | Professional ML workflow |
|
||||
|
||||
---
|
||||
|
||||
## Training Data Sources for Your Projects
|
||||
|
||||
| Source | Data Type | Fine-Tune Goal |
|
||||
| ---------------------- | ---------------------- | ------------------------ |
|
||||
| Your GitHub repos | TypeScript/Python code | Coding style model |
|
||||
| LysnrAI voice commands | Command → JSON pairs | Voice command parser |
|
||||
| MindLyst triage logs | Input → categorization | Content triage model |
|
||||
| Your commit messages | Diff → message pairs | Commit message generator |
|
||||
| Your code reviews | Code → feedback pairs | Code review assistant |
|
||||
| Slack/Teams messages | Conversations | Writing style model |
|
||||
|
||||
### Extracting Training Data from Your Repos
|
||||
|
||||
```bash
|
||||
# Extract all TypeScript functions as training examples
|
||||
find ~/code/mygh/learning_ai_common_plat -name "*.ts" -exec grep -l "export" {} \; | head -20
|
||||
|
||||
# Extract commit messages paired with diffs
|
||||
cd ~/code/mygh/learning_ai_common_plat
|
||||
git log --oneline -100 --format="%H %s" | while read hash msg; do
|
||||
echo "=== $msg ==="
|
||||
git diff "$hash^" "$hash" --stat
|
||||
echo ""
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Install training libraries in a dedicated venv
|
||||
- [ ] Collect 100+ instruction/response pairs from your codebase
|
||||
- [ ] Run a QLoRA fine-tune of Llama 3.1 8B on your coding examples
|
||||
- [ ] Evaluate: compare fine-tuned model vs base model on 20 test prompts
|
||||
- [ ] Convert to GGUF and serve through Ollama
|
||||
- [ ] Fine-tune a voice command parser for LysnrAI
|
||||
- [ ] Experiment with different LoRA ranks (8, 16, 32, 64) and measure quality
|
||||
@ -0,0 +1,382 @@
|
||||
# 5. CUDA / TensorRT / ML Research
|
||||
|
||||
> **RTX 5090:** Full NVIDIA toolchain — CUDA 13.x, cuDNN, TensorRT, Triton
|
||||
> **Why it matters:** Most ML papers, frameworks, and production systems are CUDA-first. This is the industry-standard GPU compute platform.
|
||||
|
||||
---
|
||||
|
||||
## What Is the NVIDIA ML Toolchain?
|
||||
|
||||
NVIDIA provides a layered stack of tools that turn your GPU into a general-purpose compute engine:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ NVIDIA ML Toolchain (RTX 5090) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Your Code (Python / C++ / TypeScript) │ │
|
||||
│ └────────────────────────┬────────────────────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ML Frameworks │ │
|
||||
│ │ PyTorch · TensorFlow · JAX · ONNX Runtime │ │
|
||||
│ └────────────────────────┬────────────────────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ NVIDIA Libraries │ │
|
||||
│ │ TensorRT · Triton · cuDNN · cuBLAS · NCCL │ │
|
||||
│ └────────────────────────┬────────────────────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ CUDA Runtime + Driver │ │
|
||||
│ │ CUDA 13.x · GPU scheduling · memory management │ │
|
||||
│ └────────────────────────┬────────────────────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Hardware │ │
|
||||
│ │ RTX 5090: ~18K CUDA cores · 24 GB GDDR7 · Tensor cores │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Component Breakdown
|
||||
|
||||
| Component | What It Does | Why You Need It |
|
||||
| ------------------- | ----------------------------------------------------------- | -------------------------------- |
|
||||
| **CUDA** | General-purpose GPU programming | Foundation for everything |
|
||||
| **cuDNN** | Optimized neural network primitives (conv, attention, etc.) | 2–5× faster training/inference |
|
||||
| **TensorRT** | Model optimization + inference engine | 2–4× faster deployment inference |
|
||||
| **Triton** (NVIDIA) | Inference server for serving models at scale | Production model serving |
|
||||
| **Triton** (OpenAI) | GPU kernel compiler (write custom GPU kernels in Python) | Research + custom ops |
|
||||
| **cuBLAS** | Optimized matrix multiplication | Core of all neural network math |
|
||||
| **NCCL** | Multi-GPU communication | Distributed training (future) |
|
||||
|
||||
---
|
||||
|
||||
## How to Set Up
|
||||
|
||||
### 1. CUDA Toolkit (Inside WSL2)
|
||||
|
||||
```bash
|
||||
# Check if CUDA is already available (from Windows driver passthrough)
|
||||
nvidia-smi
|
||||
nvcc --version # CUDA compiler
|
||||
|
||||
# If nvcc is not found, install CUDA toolkit
|
||||
# (nvidia-smi works from driver passthrough, but nvcc needs the toolkit)
|
||||
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
|
||||
sudo dpkg -i cuda-keyring_1.1-1_all.deb
|
||||
sudo apt update
|
||||
sudo apt install -y cuda-toolkit-12-4
|
||||
|
||||
# Add to PATH
|
||||
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
|
||||
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
|
||||
# Verify
|
||||
nvcc --version
|
||||
# Should show: CUDA 12.4+
|
||||
```
|
||||
|
||||
### 2. cuDNN
|
||||
|
||||
```bash
|
||||
# cuDNN is usually bundled with PyTorch, but for custom builds:
|
||||
sudo apt install -y libcudnn8 libcudnn8-dev
|
||||
|
||||
# Verify in Python
|
||||
python3 -c "import torch; print(f'cuDNN: {torch.backends.cudnn.version()}')"
|
||||
```
|
||||
|
||||
### 3. TensorRT
|
||||
|
||||
```bash
|
||||
# Install TensorRT
|
||||
pip install tensorrt
|
||||
|
||||
# Or via apt for system-wide
|
||||
sudo apt install -y tensorrt
|
||||
|
||||
# Verify
|
||||
python3 -c "import tensorrt; print(f'TensorRT: {tensorrt.__version__}')"
|
||||
```
|
||||
|
||||
### 4. PyTorch with Full CUDA Support
|
||||
|
||||
```bash
|
||||
# Create a research environment
|
||||
python3.12 -m venv ~/.venv-ml-research
|
||||
source ~/.venv-ml-research/bin/activate
|
||||
|
||||
# Install PyTorch with CUDA 12.4
|
||||
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
|
||||
|
||||
# Verify
|
||||
python3 -c "
|
||||
import torch
|
||||
print(f'PyTorch: {torch.__version__}')
|
||||
print(f'CUDA available: {torch.cuda.is_available()}')
|
||||
print(f'GPU: {torch.cuda.get_device_name(0)}')
|
||||
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
|
||||
print(f'cuDNN: {torch.backends.cudnn.version()}')
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How to Use It
|
||||
|
||||
### CUDA Programming Basics (Python)
|
||||
|
||||
```python
|
||||
"""Your first CUDA program — matrix multiplication on the GPU."""
|
||||
import torch
|
||||
import time
|
||||
|
||||
device = torch.device("cuda")
|
||||
|
||||
# Create two large matrices on the GPU
|
||||
A = torch.randn(4096, 4096, device=device)
|
||||
B = torch.randn(4096, 4096, device=device)
|
||||
|
||||
# Warm up
|
||||
_ = torch.mm(A, B)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Benchmark
|
||||
start = time.perf_counter()
|
||||
for _ in range(100):
|
||||
C = torch.mm(A, B)
|
||||
torch.cuda.synchronize()
|
||||
elapsed = time.perf_counter() - start
|
||||
|
||||
tflops = (2 * 4096**3 * 100) / elapsed / 1e12
|
||||
print(f"Matrix multiply: {elapsed:.3f}s for 100 iterations")
|
||||
print(f"Throughput: {tflops:.1f} TFLOPS")
|
||||
# Expected on RTX 5090: ~100–200 TFLOPS (FP32) or ~200–400 TFLOPS (FP16)
|
||||
```
|
||||
|
||||
### TensorRT: Optimize a Model for Faster Inference
|
||||
|
||||
```python
|
||||
"""Convert a PyTorch model to TensorRT for 2–4× faster inference."""
|
||||
import torch
|
||||
import torch_tensorrt
|
||||
|
||||
# Load a model
|
||||
model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True).eval().cuda()
|
||||
|
||||
# Compile with TensorRT
|
||||
trt_model = torch_tensorrt.compile(
|
||||
model,
|
||||
inputs=[torch_tensorrt.Input(shape=[1, 3, 224, 224], dtype=torch.float16)],
|
||||
enabled_precisions={torch.float16},
|
||||
)
|
||||
|
||||
# Benchmark
|
||||
input_tensor = torch.randn(1, 3, 224, 224, device="cuda", dtype=torch.float16)
|
||||
|
||||
# PyTorch baseline
|
||||
with torch.no_grad():
|
||||
start = time.perf_counter()
|
||||
for _ in range(1000):
|
||||
_ = model(input_tensor.float())
|
||||
torch.cuda.synchronize()
|
||||
pytorch_time = time.perf_counter() - start
|
||||
|
||||
# TensorRT optimized
|
||||
with torch.no_grad():
|
||||
start = time.perf_counter()
|
||||
for _ in range(1000):
|
||||
_ = trt_model(input_tensor)
|
||||
torch.cuda.synchronize()
|
||||
trt_time = time.perf_counter() - start
|
||||
|
||||
print(f"PyTorch: {pytorch_time:.3f}s")
|
||||
print(f"TensorRT: {trt_time:.3f}s")
|
||||
print(f"Speedup: {pytorch_time/trt_time:.1f}×")
|
||||
```
|
||||
|
||||
### Reproducing ML Research Papers
|
||||
|
||||
Most ML papers provide CUDA-only code. The RTX 5090 lets you run them directly:
|
||||
|
||||
```bash
|
||||
# Example: Run a recent paper's code
|
||||
git clone https://github.com/some-researcher/cool-new-model.git
|
||||
cd cool-new-model
|
||||
|
||||
# Typical requirements
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run training/evaluation
|
||||
python train.py --device cuda --epochs 10 --batch-size 16
|
||||
|
||||
# This would NOT work on Mac (CUDA-only dependencies)
|
||||
```
|
||||
|
||||
### Custom CUDA Kernels with Triton (OpenAI)
|
||||
|
||||
```python
|
||||
"""Write a custom GPU kernel in Python using Triton."""
|
||||
import triton
|
||||
import triton.language as tl
|
||||
import torch
|
||||
|
||||
@triton.jit
|
||||
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
|
||||
"""Simple vector addition kernel running on GPU."""
|
||||
pid = tl.program_id(axis=0)
|
||||
block_start = pid * BLOCK_SIZE
|
||||
offsets = block_start + tl.arange(0, BLOCK_SIZE)
|
||||
mask = offsets < n_elements
|
||||
|
||||
x = tl.load(x_ptr + offsets, mask=mask)
|
||||
y = tl.load(y_ptr + offsets, mask=mask)
|
||||
output = x + y
|
||||
tl.store(output_ptr + offsets, output, mask=mask)
|
||||
|
||||
# Run the kernel
|
||||
n = 1_000_000
|
||||
x = torch.randn(n, device="cuda")
|
||||
y = torch.randn(n, device="cuda")
|
||||
output = torch.empty_like(x)
|
||||
|
||||
grid = lambda meta: (triton.cdiv(n, meta["BLOCK_SIZE"]),)
|
||||
add_kernel[grid](x, y, output, n, BLOCK_SIZE=1024)
|
||||
|
||||
print(f"Result correct: {torch.allclose(output, x + y)}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases for Your Projects
|
||||
|
||||
### 1. LysnrAI Inference Optimization
|
||||
|
||||
Convert your Whisper or TTS models to TensorRT for even faster inference:
|
||||
|
||||
```python
|
||||
# Optimize Whisper encoder with TensorRT
|
||||
# This can give another 2× speedup on top of whisper.cpp CUDA
|
||||
```
|
||||
|
||||
### 2. Custom Embedding Models
|
||||
|
||||
Train domain-specific embedding models for LysnrAI's semantic search:
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
|
||||
|
||||
# Encode your documents
|
||||
documents = ["meeting notes about Q1 budget", "grocery list for weekend", ...]
|
||||
embeddings = model.encode(documents, batch_size=64, show_progress_bar=True)
|
||||
# ~1000 documents/second on RTX 5090
|
||||
```
|
||||
|
||||
### 3. Reproduce and Experiment with New Models
|
||||
|
||||
When a new paper drops (GPT-5 alternatives, new TTS models, new architectures):
|
||||
|
||||
```bash
|
||||
# Clone → install → run — no CUDA compatibility issues
|
||||
git clone https://github.com/new-cool-model
|
||||
pip install -r requirements.txt
|
||||
python evaluate.py --device cuda
|
||||
```
|
||||
|
||||
### 4. Benchmarking and Profiling
|
||||
|
||||
```python
|
||||
# Profile GPU usage during inference
|
||||
with torch.profiler.profile(
|
||||
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
|
||||
with_stack=True,
|
||||
) as prof:
|
||||
model(input_tensor)
|
||||
|
||||
print(prof.key_averages().table(sort_by="cuda_time_total"))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
| ------------------------- | --------------------------------------------------------- |
|
||||
| **Industry standard** | 95%+ of ML research and production uses CUDA |
|
||||
| **Framework support** | PyTorch, TensorFlow, JAX — all CUDA-first |
|
||||
| **Paper reproduction** | Run any ML paper's code without compatibility issues |
|
||||
| **TensorRT optimization** | 2–4× faster inference on optimized models |
|
||||
| **Custom kernels** | Write GPU code in Python (Triton) or C++ (CUDA) |
|
||||
| **Profiling tools** | nvidia-smi, Nsight, PyTorch profiler — rich debugging |
|
||||
| **Production parity** | Code runs identically on cloud GPU instances (A100, H100) |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Career Value |
|
||||
| ------------------------- | ----------------------------------------------------- | ---------------------------------- |
|
||||
| **CUDA fundamentals** | GPU memory model, kernel launches, synchronization | Core ML infrastructure skill |
|
||||
| **TensorRT** | Model optimization, quantization, graph fusion | Production ML deployment |
|
||||
| **Triton kernels** | Custom GPU programming in Python | Research + performance engineering |
|
||||
| **PyTorch profiling** | Identifying bottlenecks, optimizing training loops | Essential ML engineering |
|
||||
| **cuDNN** | Optimized neural network operations | Framework-level understanding |
|
||||
| **Mixed precision** | FP16/BF16 training, loss scaling, numerical stability | Modern training standard |
|
||||
| **GPU memory management** | Memory pools, caching allocator, OOM debugging | Practical ML engineering |
|
||||
| **Model optimization** | Graph optimization, operator fusion, quantization | ML deployment pipeline |
|
||||
| **Benchmark design** | Fair comparison methodology, statistical significance | Research methodology |
|
||||
|
||||
### Career Impact
|
||||
|
||||
CUDA proficiency is one of the most valuable ML engineering skills. Here's where it maps:
|
||||
|
||||
| Role | How CUDA Skills Apply |
|
||||
| --------------------- | ----------------------------------------------------- |
|
||||
| **ML Engineer** | Optimize training pipelines, reduce inference latency |
|
||||
| **AI Infrastructure** | Build and maintain GPU clusters, optimize throughput |
|
||||
| **Research Engineer** | Implement custom operations for novel architectures |
|
||||
| **MLOps** | TensorRT deployment, GPU monitoring, autoscaling |
|
||||
| **Full-Stack AI** | End-to-end: train → optimize → serve → monitor |
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Debugging
|
||||
|
||||
```bash
|
||||
# Real-time GPU monitoring
|
||||
watch -n 0.5 nvidia-smi
|
||||
|
||||
# Detailed GPU info
|
||||
nvidia-smi -q
|
||||
|
||||
# GPU process list
|
||||
nvidia-smi pmon
|
||||
|
||||
# Install nvtop (interactive GPU monitor)
|
||||
sudo apt install nvtop
|
||||
nvtop
|
||||
|
||||
# PyTorch memory debugging
|
||||
python3 -c "
|
||||
import torch
|
||||
torch.cuda.memory_summary(device=0, abbreviated=True)
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Install CUDA toolkit + cuDNN in WSL2
|
||||
- [ ] Verify PyTorch CUDA with a matrix multiply benchmark
|
||||
- [ ] Run a TensorRT optimization on a simple model
|
||||
- [ ] Write a Triton kernel (vector add → custom attention)
|
||||
- [ ] Profile an Ollama inference request with nvidia-smi
|
||||
- [ ] Try reproducing a recent ML paper from GitHub
|
||||
- [ ] Benchmark: PyTorch vs TensorRT inference speed for Whisper
|
||||
@ -0,0 +1,325 @@
|
||||
# 6. Stable Diffusion / Image Generation
|
||||
|
||||
> **RTX 5090:** 5–8 seconds per image (SDXL) vs ~30 seconds on Mac
|
||||
> **Why it matters:** Generate UI mockups, app icons, marketing assets, concept art — all locally and free
|
||||
|
||||
---
|
||||
|
||||
## What Is Stable Diffusion?
|
||||
|
||||
Stable Diffusion is an open-source text-to-image AI model. You describe what you want in plain English, and it generates a high-quality image in seconds. It runs entirely on your GPU.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Stable Diffusion Pipeline │
|
||||
│ │
|
||||
│ "A modern app dashboard with dark theme and blue accents" │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ [CLIP Text Encoder] → text embeddings │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ [U-Net: iterative denoising × 20–50 steps] ← GPU-intensive │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ [VAE Decoder] → pixel image │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 1024×1024 PNG image │
|
||||
│ │
|
||||
│ RTX 5090: ~5–8s per image (SDXL) │
|
||||
│ Mac M4 Pro: ~25–35s per image (SDXL via MPS) │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance: Mac vs Razer
|
||||
|
||||
| Model | Resolution | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup |
|
||||
| ---------------- | ---------- | ---------------- | --------------------- | ------- |
|
||||
| SD 1.5 | 512×512 | ~8–12s | ~1–2s | ~5× |
|
||||
| SDXL | 1024×1024 | ~25–35s | ~5–8s | ~4× |
|
||||
| SDXL Turbo | 1024×1024 | ~8–12s | ~1–3s | ~4× |
|
||||
| FLUX.1 [dev] | 1024×1024 | ~60–90s | ~10–20s | ~5× |
|
||||
| FLUX.1 [schnell] | 1024×1024 | ~15–25s | ~3–6s | ~4× |
|
||||
|
||||
### VRAM Requirements
|
||||
|
||||
| Model | VRAM Needed | Fits in 24 GB? |
|
||||
| ----------------- | ----------- | -------------- |
|
||||
| SD 1.5 | ~4 GB | ✅ Easily |
|
||||
| SDXL | ~7 GB | ✅ Easily |
|
||||
| SDXL + ControlNet | ~10 GB | ✅ Yes |
|
||||
| FLUX.1 [dev] | ~12 GB | ✅ Yes |
|
||||
| FLUX.1 + LoRA | ~14 GB | ✅ Yes |
|
||||
| SD3 Medium | ~12 GB | ✅ Yes |
|
||||
|
||||
**24 GB VRAM means every current image model fits comfortably.**
|
||||
|
||||
---
|
||||
|
||||
## How to Set Up
|
||||
|
||||
### Option A: ComfyUI (Node-Based — Recommended)
|
||||
|
||||
ComfyUI is a powerful node-based UI for Stable Diffusion. It's flexible, efficient, and well-suited for automation.
|
||||
|
||||
```bash
|
||||
# Clone ComfyUI
|
||||
cd ~
|
||||
git clone https://github.com/comfyanonymous/ComfyUI.git
|
||||
cd ComfyUI
|
||||
|
||||
# Create venv and install
|
||||
python3.12 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Download SDXL model (~6.5 GB)
|
||||
mkdir -p models/checkpoints
|
||||
cd models/checkpoints
|
||||
wget "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
|
||||
|
||||
# Start ComfyUI
|
||||
cd ~/ComfyUI
|
||||
python main.py
|
||||
|
||||
# Open in browser: http://localhost:8188
|
||||
# Accessible from Windows browser via WSL2 port forwarding
|
||||
```
|
||||
|
||||
### Option B: Automatic1111 (Classic Web UI)
|
||||
|
||||
```bash
|
||||
# Clone A1111
|
||||
cd ~
|
||||
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
|
||||
cd stable-diffusion-webui
|
||||
|
||||
# Launch (auto-installs deps on first run)
|
||||
bash webui.sh --listen --xformers
|
||||
|
||||
# Open: http://localhost:7860
|
||||
```
|
||||
|
||||
### Option C: Python Script (No UI)
|
||||
|
||||
```python
|
||||
"""Generate images with Stable Diffusion from Python."""
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
|
||||
# Load SDXL
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16",
|
||||
).to("cuda")
|
||||
|
||||
# Enable memory optimizations
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# Generate
|
||||
image = pipe(
|
||||
prompt="A sleek dark-themed mobile app dashboard showing AI brain categories, "
|
||||
"neon blue and teal accents, glassmorphism cards, modern UI design",
|
||||
negative_prompt="blurry, low quality, text, watermark",
|
||||
num_inference_steps=30,
|
||||
guidance_scale=7.5,
|
||||
width=1024,
|
||||
height=1024,
|
||||
).images[0]
|
||||
|
||||
image.save("dashboard_concept.png")
|
||||
print("Generated: dashboard_concept.png")
|
||||
```
|
||||
|
||||
### Batch Generation Script
|
||||
|
||||
```python
|
||||
"""Generate multiple variations of an image concept."""
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import time
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16",
|
||||
).to("cuda")
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
prompts = [
|
||||
"App icon for LysnrAI, microphone with sound waves, dark background, modern gradient",
|
||||
"App icon for MindLyst, brain with neural connections, dark background, blue teal gradient",
|
||||
"Feature graphic for LysnrAI, voice waveform visualization, dark theme, 1024x500",
|
||||
"Splash screen, abstract sound waves, dark navy background, teal highlights",
|
||||
"Dashboard mockup, dark theme, cards with charts, sidebar navigation, modern UI",
|
||||
]
|
||||
|
||||
for i, prompt in enumerate(prompts):
|
||||
start = time.time()
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
negative_prompt="blurry, low quality, text, watermark, ugly",
|
||||
num_inference_steps=30,
|
||||
guidance_scale=7.5,
|
||||
width=1024,
|
||||
height=1024,
|
||||
generator=torch.Generator("cuda").manual_seed(42 + i),
|
||||
).images[0]
|
||||
|
||||
elapsed = time.time() - start
|
||||
filename = f"generated_{i:02d}.png"
|
||||
image.save(filename)
|
||||
print(f"[{i+1}/{len(prompts)}] {filename} ({elapsed:.1f}s)")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Real-World Use Cases for Your Projects
|
||||
|
||||
### 1. App Store Assets
|
||||
|
||||
Generate icon concepts, feature graphics, and screenshot backgrounds:
|
||||
|
||||
```python
|
||||
# LysnrAI app icon variations
|
||||
prompts = [
|
||||
"Minimal app icon, single microphone, dark navy background, glowing teal outline, flat design",
|
||||
"App icon, sound wave forming a brain shape, dark background, blue to teal gradient",
|
||||
"App icon, headphones with AI sparkles, dark background, modern glassmorphism",
|
||||
]
|
||||
```
|
||||
|
||||
### 2. UI/UX Mockup Exploration
|
||||
|
||||
Rapidly prototype visual ideas before coding:
|
||||
|
||||
```python
|
||||
# Generate dashboard layout concepts
|
||||
prompt = """
|
||||
Web dashboard design, dark theme (#06070A background), sidebar navigation,
|
||||
main content area with 3 cards showing brain categories,
|
||||
teal and blue accent colors, modern glassmorphism,
|
||||
high fidelity UI design, clean typography
|
||||
"""
|
||||
```
|
||||
|
||||
### 3. Marketing and Social Media
|
||||
|
||||
```python
|
||||
# Blog post hero images
|
||||
prompt = "Abstract AI visualization, neural network nodes connected by light beams, dark background, blue and teal colors, cinematic lighting"
|
||||
|
||||
# Social media posts
|
||||
prompt = "Infographic style, voice AI assistant concept, microphone with sound waves transforming into text, dark modern design"
|
||||
```
|
||||
|
||||
### 4. Concept Art for Features
|
||||
|
||||
```python
|
||||
# Visualize MindLyst "Brains" concept
|
||||
prompts = [
|
||||
"Digital brain labeled 'Work', organized files and charts floating around it, dark theme, blue glow",
|
||||
"Digital brain labeled 'Health', fitness and medical icons floating around it, dark theme, green glow",
|
||||
"Digital brain labeled 'Finance', money and graphs floating around it, dark theme, gold glow",
|
||||
]
|
||||
```
|
||||
|
||||
### 5. ControlNet (Image-Guided Generation)
|
||||
|
||||
Use an existing image as a structural guide:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
|
||||
|
||||
# Load ControlNet for edge-guided generation
|
||||
controlnet = ControlNetModel.from_pretrained(
|
||||
"diffusers/controlnet-canny-sdxl-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
|
||||
# Use your existing dashboard screenshot as a structural guide
|
||||
# Generate a redesigned version with new visual style
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
| --------------------------- | ---------------------------------------------------------------- |
|
||||
| **Speed** | 5–8s per image (vs 30s on Mac or waiting for cloud APIs) |
|
||||
| **Cost** | $0 per image (vs $0.02–0.08 per image for DALL-E 3 / Midjourney) |
|
||||
| **Privacy** | Generated locally — no images sent to cloud |
|
||||
| **Control** | Full parameter control: steps, guidance, seed, resolution |
|
||||
| **Reproducibility** | Same seed = same image every time |
|
||||
| **Customization** | LoRA fine-tuning for brand-specific styles |
|
||||
| **Batch capability** | Generate hundreds of variations overnight |
|
||||
| **No content restrictions** | No cloud content policies limiting your output |
|
||||
|
||||
### Cost Comparison (100 images)
|
||||
|
||||
| Method | Cost | Time |
|
||||
| -------------------------------- | ------------ | -------------- |
|
||||
| DALL-E 3 API | $4–8 | ~5 min (cloud) |
|
||||
| Midjourney | $10–30/month | ~5 min (cloud) |
|
||||
| Mac M4 Pro (SDXL, local) | $0 | ~40–55 min |
|
||||
| **Razer RTX 5090 (SDXL, local)** | **$0** | **~8–13 min** |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Learn
|
||||
|
||||
| Skill | What You'll Learn | Career Value |
|
||||
| ------------------------- | --------------------------------------------------- | ---------------------------- |
|
||||
| **Diffusion models** | How iterative denoising generates images from noise | Core generative AI knowledge |
|
||||
| **Prompt engineering** | Crafting effective text prompts for visual output | Universal AI skill |
|
||||
| **ControlNet** | Structural guidance for image generation | Advanced image AI |
|
||||
| **LoRA training** | Fine-tuning image models for specific styles | Generative AI customization |
|
||||
| **ComfyUI workflows** | Node-based visual programming for AI pipelines | Production image generation |
|
||||
| **Image post-processing** | Upscaling, inpainting, outpainting techniques | Digital content creation |
|
||||
| **VRAM optimization** | Model offloading, attention optimization, tiling | GPU memory management |
|
||||
| **Batch automation** | Scripting large-scale image generation | Production engineering |
|
||||
| **Model selection** | SD 1.5 vs SDXL vs FLUX — trade-offs | Practical AI judgment |
|
||||
|
||||
---
|
||||
|
||||
## Advanced: FLUX.1 (Latest Generation)
|
||||
|
||||
FLUX.1 is the newest open-source image model (from Black Forest Labs / Stability AI alumni). Better quality than SDXL.
|
||||
|
||||
```bash
|
||||
# FLUX.1 [schnell] — fast, 4 steps
|
||||
# FLUX.1 [dev] — high quality, 50 steps
|
||||
|
||||
# Fits in 24 GB VRAM with FP16
|
||||
python3 -c "
|
||||
from diffusers import FluxPipeline
|
||||
import torch
|
||||
|
||||
pipe = FluxPipeline.from_pretrained(
|
||||
'black-forest-labs/FLUX.1-schnell',
|
||||
torch_dtype=torch.float16,
|
||||
).to('cuda')
|
||||
|
||||
image = pipe('A futuristic AI assistant interface, holographic UI, dark theme').images[0]
|
||||
image.save('flux_test.png')
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Install ComfyUI in WSL2 and verify CUDA acceleration
|
||||
- [ ] Download SDXL base model and generate first test image
|
||||
- [ ] Generate app icon concepts for LysnrAI and MindLyst
|
||||
- [ ] Try ControlNet with an existing dashboard screenshot
|
||||
- [ ] Experiment with FLUX.1 [schnell] for higher quality
|
||||
- [ ] Create a batch script for generating marketing assets
|
||||
- [ ] Build a style LoRA trained on your brand colors/aesthetic
|
||||
@ -0,0 +1,373 @@
|
||||
# 7. Multi-GPU Workloads (Future Path)
|
||||
|
||||
> **RTX 5090:** Your first serious CUDA GPU — a stepping stone to multi-GPU and cloud GPU workflows
|
||||
> **Why it matters:** The skills, code, and workflows you build on one GPU translate directly to multi-GPU and cloud infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Why Think About Multi-GPU Now?
|
||||
|
||||
You don't need multiple GPUs today. But learning on a single RTX 5090 builds skills that scale directly:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ GPU Scaling Path │
|
||||
│ │
|
||||
│ TODAY │
|
||||
│ ┌─────────────────────────────────┐ │
|
||||
│ │ RTX 5090 Laptop (24 GB VRAM) │ ← You are here │
|
||||
│ │ Single GPU, local inference │ │
|
||||
│ │ Fine-tuning up to 13B models │ │
|
||||
│ └────────────────┬────────────────┘ │
|
||||
│ │ │
|
||||
│ NEAR FUTURE ▼ │
|
||||
│ ┌─────────────────────────────────┐ │
|
||||
│ │ Desktop + eGPU or 2× GPU tower │ │
|
||||
│ │ 48–96 GB total VRAM │ │
|
||||
│ │ Fine-tune 70B models │ │
|
||||
│ │ Run 2–3 models simultaneously │ │
|
||||
│ └────────────────┬────────────────┘ │
|
||||
│ │ │
|
||||
│ CLOUD / PROD ▼ │
|
||||
│ ┌─────────────────────────────────┐ │
|
||||
│ │ Cloud GPU instances │ │
|
||||
│ │ A100/H100 × 2–8 (80–640 GB) │ │
|
||||
│ │ Train large models │ │
|
||||
│ │ Serve at scale │ │
|
||||
│ └─────────────────────────────────┘ │
|
||||
│ │
|
||||
│ SAME CUDA code works at every level ↑ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What Multi-GPU Enables
|
||||
|
||||
| Capability | Single GPU (24 GB) | 2× GPU (48 GB) | 4× GPU (96 GB) | Cloud (640 GB) |
|
||||
| --------------------- | ------------------ | --------------- | --------------- | -------------- |
|
||||
| 7B inference | ✅ Very fast | ✅ + concurrent | ✅ + concurrent | ✅ at scale |
|
||||
| 32B inference | ✅ Fits in VRAM | ✅ Very fast | ✅ Very fast | ✅ at scale |
|
||||
| 70B inference | ⚠️ Partial GPU | ✅ Full GPU | ✅ Very fast | ✅ at scale |
|
||||
| 7B fine-tune (QLoRA) | ✅ Comfortable | ✅ Faster | ✅ Faster | ✅ Fastest |
|
||||
| 13B fine-tune (QLoRA) | ✅ Fits | ✅ Comfortable | ✅ Fast | ✅ Fastest |
|
||||
| 70B fine-tune (QLoRA) | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable |
|
||||
| 7B full fine-tune | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable |
|
||||
| Serve 3+ models | ❌ VRAM limit | ✅ Yes | ✅ Yes | ✅ Yes |
|
||||
| SDXL + LLM concurrent | ⚠️ Tight | ✅ Yes | ✅ Yes | ✅ Yes |
|
||||
|
||||
---
|
||||
|
||||
## Expansion Options
|
||||
|
||||
### Option 1: eGPU (Thunderbolt/USB4)
|
||||
|
||||
Connect an external GPU to your Razer Blade via Thunderbolt 4:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────┐ Thunderbolt 4 ┌──────────────────────┐
|
||||
│ Razer Blade 18 │◄═══(~40 Gbps)════════►│ eGPU Enclosure │
|
||||
│ RTX 5090 (24 GB) — internal │ │ RTX 4090 (24 GB) │
|
||||
│ │ │ or RTX 5090 (24 GB) │
|
||||
└──────────────────────────────────┘ └──────────────────────┘
|
||||
|
||||
Total VRAM: 48 GB (24 + 24)
|
||||
Limitation: Thunderbolt bandwidth (~40 Gbps) is slower than PCIe 5.0 (~128 Gbps)
|
||||
Best for: Model serving (latency-tolerant), not training (bandwidth-sensitive)
|
||||
```
|
||||
|
||||
**Recommended eGPU enclosures:**
|
||||
| Enclosure | Price | GPU Support |
|
||||
|-----------|-------|-------------|
|
||||
| Razer Core X | ~$300 | Full-length desktop GPUs |
|
||||
| Sonnet Breakaway | ~$250 | Most desktop GPUs |
|
||||
| Akitio Node | ~$200 | Compact form factor |
|
||||
|
||||
### Option 2: Desktop Build (Maximum Performance)
|
||||
|
||||
Build a dedicated GPU workstation:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Desktop GPU Workstation │
|
||||
│ │
|
||||
│ Motherboard: ASUS/MSI with 2–4× PCIe 5.0 x16 slots │
|
||||
│ CPU: Intel i9 or AMD Ryzen 9 (enough PCIe lanes) │
|
||||
│ RAM: 128 GB DDR5 │
|
||||
│ GPU 1: RTX 5090 (24 GB GDDR7) │
|
||||
│ GPU 2: RTX 5090 (24 GB GDDR7) │
|
||||
│ Total VRAM: 48 GB │
|
||||
│ PSU: 1200W+ (two 5090s draw ~600W under load) │
|
||||
│ │
|
||||
│ Cost: ~$5,000–7,000 │
|
||||
│ Performance: Near-datacenter for most workloads │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Option 3: Cloud GPU (On-Demand)
|
||||
|
||||
Rent GPU time when you need it:
|
||||
|
||||
| Provider | GPU | VRAM | Price/Hour | Best For |
|
||||
| ----------- | --------- | ------ | ---------- | -------------------- |
|
||||
| Lambda Labs | A100 80GB | 80 GB | ~$1.10 | Training |
|
||||
| RunPod | A100 80GB | 80 GB | ~$1.64 | Training + inference |
|
||||
| Vast.ai | RTX 4090 | 24 GB | ~$0.30 | Budget inference |
|
||||
| AWS (p4d) | A100 ×8 | 640 GB | ~$32 | Large-scale training |
|
||||
| Together AI | H100 | 80 GB | ~$2.50 | Fine-tuning API |
|
||||
|
||||
**Your RTX 5090 code runs identically on cloud GPUs** — same PyTorch, same CUDA.
|
||||
|
||||
---
|
||||
|
||||
## How to Prepare Now (Single GPU)
|
||||
|
||||
### 1. Write GPU-Agnostic Code
|
||||
|
||||
Structure your code so it works with any number of GPUs:
|
||||
|
||||
```python
|
||||
"""GPU-agnostic model loading — works with 1 or N GPUs."""
|
||||
import torch
|
||||
|
||||
def get_device():
|
||||
"""Select the best available device."""
|
||||
if torch.cuda.is_available():
|
||||
# Multi-GPU: use DataParallel or DistributedDataParallel
|
||||
if torch.cuda.device_count() > 1:
|
||||
print(f"Using {torch.cuda.device_count()} GPUs")
|
||||
return "cuda" # PyTorch handles multi-GPU distribution
|
||||
return "cuda:0"
|
||||
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
|
||||
return "mps"
|
||||
return "cpu"
|
||||
|
||||
device = get_device()
|
||||
model = MyModel().to(device)
|
||||
|
||||
# Wrap for multi-GPU (no-op on single GPU)
|
||||
if torch.cuda.device_count() > 1:
|
||||
model = torch.nn.DataParallel(model)
|
||||
```
|
||||
|
||||
### 2. Learn Model Parallelism Concepts
|
||||
|
||||
```python
|
||||
"""Pipeline parallelism — split model layers across GPUs."""
|
||||
# Example: split a large model across 2 GPUs
|
||||
|
||||
# GPU 0: layers 0–15
|
||||
# GPU 1: layers 16–31
|
||||
|
||||
# With Hugging Face Accelerate:
|
||||
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
|
||||
|
||||
with init_empty_weights():
|
||||
model = AutoModelForCausalLM.from_config(config)
|
||||
|
||||
model = load_checkpoint_and_dispatch(
|
||||
model,
|
||||
checkpoint="./model-weights",
|
||||
device_map="auto", # Automatically distributes across available GPUs
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Ollama Multi-GPU (Already Supported)
|
||||
|
||||
Ollama can split a single large model across multiple GPUs:
|
||||
|
||||
```bash
|
||||
# When you have 2 GPUs, Ollama auto-detects and splits
|
||||
# A 70B model: 24 GB on GPU 0, 16 GB on GPU 1, rest in RAM
|
||||
|
||||
# Or manually control GPU assignment
|
||||
CUDA_VISIBLE_DEVICES=0,1 ollama serve
|
||||
|
||||
# Check GPU allocation
|
||||
nvidia-smi # Shows VRAM usage per GPU
|
||||
```
|
||||
|
||||
### 4. vLLM (High-Throughput Inference Server)
|
||||
|
||||
```bash
|
||||
# vLLM supports tensor parallelism across GPUs
|
||||
pip install vllm
|
||||
|
||||
# Serve a model across 2 GPUs
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
|
||||
--tensor-parallel-size 2 \
|
||||
--gpu-memory-utilization 0.9
|
||||
|
||||
# API compatible with OpenAI format
|
||||
curl http://localhost:8000/v1/completions -d '{
|
||||
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
"prompt": "Hello",
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scaling Scenarios for Your Projects
|
||||
|
||||
### Scenario 1: LysnrAI at Scale
|
||||
|
||||
```
|
||||
TODAY (1× RTX 5090):
|
||||
- 1 user, local inference
|
||||
- Whisper + Ollama + TTS, one at a time
|
||||
|
||||
FUTURE (2× GPU desktop):
|
||||
- GPU 0: Ollama (always-on coding model)
|
||||
- GPU 1: Whisper + TTS (dedicated)
|
||||
- Run all 3 workloads concurrently
|
||||
|
||||
PRODUCTION (Cloud):
|
||||
- vLLM serving on A100
|
||||
- Whisper on dedicated GPU
|
||||
- TTS on dedicated GPU
|
||||
- Handles 100+ concurrent users
|
||||
```
|
||||
|
||||
### Scenario 2: Fine-Tuning Larger Models
|
||||
|
||||
```
|
||||
TODAY (24 GB VRAM):
|
||||
- QLoRA 7B–13B models
|
||||
- Training time: 1–4 hours
|
||||
|
||||
FUTURE (48 GB VRAM):
|
||||
- QLoRA 70B models
|
||||
- LoRA FP16 32B models
|
||||
- Training time: 4–12 hours
|
||||
|
||||
CLOUD (80+ GB VRAM):
|
||||
- Full fine-tune 7B–13B models
|
||||
- QLoRA 70B+ models
|
||||
- Training time: hours with H100
|
||||
```
|
||||
|
||||
### Scenario 3: Image + Text Generation Pipeline
|
||||
|
||||
```
|
||||
TODAY (1× GPU):
|
||||
- SDXL OR LLM, not both at once (VRAM constraint)
|
||||
|
||||
FUTURE (2× GPU):
|
||||
- GPU 0: LLM (Ollama, 32B model, ~19 GB)
|
||||
- GPU 1: SDXL + ControlNet (~10 GB)
|
||||
- Generate images guided by LLM descriptions
|
||||
|
||||
PRODUCTION:
|
||||
- Automated content pipeline:
|
||||
LLM writes description → SDXL generates image → Whisper adds voiceover
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits of Starting Single-GPU
|
||||
|
||||
| Benefit | Description |
|
||||
| ------------------------ | ------------------------------------------------------------------- |
|
||||
| **Code portability** | CUDA code runs the same on 1, 2, or 100 GPUs |
|
||||
| **Skill foundation** | Memory management, profiling, optimization skills transfer directly |
|
||||
| **Cost efficiency** | Learn on local hardware ($0) before renting cloud ($$$) |
|
||||
| **Workflow development** | Build training pipelines, inference servers, batch scripts now |
|
||||
| **Hardware literacy** | Understand VRAM limits, bandwidth, PCIe bottlenecks |
|
||||
|
||||
---
|
||||
|
||||
## Skills You'll Build Toward
|
||||
|
||||
| Skill | Single GPU (Now) | Multi-GPU (Future) | Career Impact |
|
||||
| ------------------------ | ------------------ | ---------------------------- | ----------------------- |
|
||||
| **CUDA programming** | Kernels, memory | NCCL, all-reduce | ML Infrastructure |
|
||||
| **Model parallelism** | Understand concept | Implement tensor/pipeline | Senior ML Engineer |
|
||||
| **Distributed training** | Data loading | Multi-node coordination | ML Platform Engineer |
|
||||
| **Inference serving** | Ollama, local API | vLLM, Triton, load balancing | MLOps / Production |
|
||||
| **GPU monitoring** | nvidia-smi, nvtop | Cluster monitoring | DevOps / SRE |
|
||||
| **Cost optimization** | VRAM budget | Spot instances, right-sizing | FinOps / ML Engineering |
|
||||
| **Batch scheduling** | Cron jobs | Kubernetes, Slurm | ML Platform |
|
||||
|
||||
### Learning Path
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Skills Progression │
|
||||
│ │
|
||||
│ Level 1 (Now — RTX 5090 Single GPU) │
|
||||
│ ├── PyTorch + CUDA basics │
|
||||
│ ├── Ollama model serving │
|
||||
│ ├── QLoRA fine-tuning 7B models │
|
||||
│ ├── nvidia-smi monitoring │
|
||||
│ └── TensorRT basic optimization │
|
||||
│ │
|
||||
│ Level 2 (6 months — Same GPU, deeper skills) │
|
||||
│ ├── Custom Triton kernels │
|
||||
│ ├── vLLM inference server │
|
||||
│ ├── Advanced quantization (AWQ, GPTQ) │
|
||||
│ ├── Profiling and optimization │
|
||||
│ └── Model merging and adapter stacking │
|
||||
│ │
|
||||
│ Level 3 (12 months — Multi-GPU or Cloud) │
|
||||
│ ├── Multi-GPU inference (tensor parallelism) │
|
||||
│ ├── Distributed training (DDP, FSDP) │
|
||||
│ ├── Cloud GPU workflow (Lambda, RunPod) │
|
||||
│ ├── Production serving with autoscaling │
|
||||
│ └── NCCL and multi-node communication │
|
||||
│ │
|
||||
│ Level 4 (Future — Production ML) │
|
||||
│ ├── Kubernetes + GPU scheduling │
|
||||
│ ├── Model serving at scale (thousands of requests/sec) │
|
||||
│ ├── Training pipelines with experiment tracking │
|
||||
│ ├── Custom model architectures │
|
||||
│ └── Open-source ML contributions │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Planning
|
||||
|
||||
### Single GPU (Current)
|
||||
|
||||
| Item | Cost | Status |
|
||||
| ---------------------------------- | ----------- | --------- |
|
||||
| Razer Blade 18 RTX 5090 | $5,200 | Purchased |
|
||||
| Electricity (~200W avg, 8 hrs/day) | ~$15/month | Ongoing |
|
||||
| **Total first year** | **~$5,380** | |
|
||||
|
||||
### Desktop Upgrade (Future)
|
||||
|
||||
| Item | Estimated Cost |
|
||||
| -------------------------- | -------------- |
|
||||
| Desktop tower + PSU + RAM | ~$1,500 |
|
||||
| RTX 5090 desktop GPU | ~$2,000 |
|
||||
| Second RTX 5090 (optional) | ~$2,000 |
|
||||
| **Total (2× GPU desktop)** | **~$5,500** |
|
||||
|
||||
### Cloud Alternative (Per-Use)
|
||||
|
||||
| Usage Pattern | Monthly Cost |
|
||||
| ----------------------- | ------------ |
|
||||
| 10 hours/month on A100 | ~$11 |
|
||||
| 40 hours/month on A100 | ~$44 |
|
||||
| 160 hours/month on A100 | ~$176 |
|
||||
| Always-on A100 | ~$792 |
|
||||
|
||||
**Break-even vs desktop:** ~12–18 months at heavy usage (40+ hours/month).
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Write all training and inference scripts to be GPU-count agnostic
|
||||
- [ ] Install and test vLLM on single GPU with Llama 3.1 8B
|
||||
- [ ] Practice monitoring GPU memory and compute utilization
|
||||
- [ ] Try model offloading: run a 70B model with partial CPU/GPU split
|
||||
- [ ] Explore Lambda Labs or RunPod for a cloud GPU test ($5–10 experiment)
|
||||
- [ ] Benchmark single GPU throughput to establish a baseline for comparison
|
||||
52
__LOCAL_LLMs/windows_specific/capabilities/README.md
Normal file
52
__LOCAL_LLMs/windows_specific/capabilities/README.md
Normal file
@ -0,0 +1,52 @@
|
||||
# RTX 5090 Capabilities — Deep Dive Guides
|
||||
|
||||
> What you can do with the Razer Blade 18's RTX 5090 (24 GB GDDR7) that you can't (or can't do well) on the Mac.
|
||||
|
||||
Each guide covers: **what it is → how to use it → real-world use cases → benefits → skills you'll learn.**
|
||||
|
||||
---
|
||||
|
||||
## Guides
|
||||
|
||||
| # | Capability | Key Benefit | Skill Level |
|
||||
| --------------------------------------- | --------------------------------- | ------------------------------------ | ------------ |
|
||||
| [01](01-gpu-inference-speed.md) | **GPU Inference Speed** | 2–4× faster LLM responses | Beginner |
|
||||
| [02](02-whisper-batch-transcription.md) | **Whisper Batch Transcription** | Hours of audio in minutes | Beginner |
|
||||
| [03](03-tts-generation-at-scale.md) | **TTS Generation at Scale** | Faster-than-realtime voice synthesis | Beginner |
|
||||
| [04](04-fine-tuning-training.md) | **Fine-Tuning / Training** | Customize models on your own data | Intermediate |
|
||||
| [05](05-cuda-tensorrt-ml-research.md) | **CUDA / TensorRT / ML Research** | Full NVIDIA ML toolchain | Intermediate |
|
||||
| [06](06-stable-diffusion-image-gen.md) | **Stable Diffusion / Image Gen** | 5–8s per image, unlimited free | Beginner |
|
||||
| [07](07-multi-gpu-workloads.md) | **Multi-GPU Workloads (Future)** | Scaling path to production | Advanced |
|
||||
|
||||
---
|
||||
|
||||
## Suggested Learning Order
|
||||
|
||||
```
|
||||
Week 1: 01 (Inference) → 02 (Whisper) → 03 (TTS)
|
||||
Get familiar with the GPU, benchmark your models
|
||||
|
||||
Week 2: 06 (Stable Diffusion)
|
||||
Set up ComfyUI, generate app assets
|
||||
|
||||
Week 3: 04 (Fine-Tuning)
|
||||
QLoRA your first 7B model on your own code
|
||||
|
||||
Week 4: 05 (CUDA / TensorRT)
|
||||
Deeper GPU programming, profiling, optimization
|
||||
|
||||
Ongoing: 07 (Multi-GPU)
|
||||
Reference as you plan scaling
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
All guides assume you've completed the [Windows setup](../setup-guide.md):
|
||||
|
||||
- NVIDIA drivers installed (Windows)
|
||||
- Ollama installed and running (Windows)
|
||||
- WSL2 Ubuntu 24.04 set up
|
||||
- Repo cloned, `setup-tts.sh` completed
|
||||
- Dashboard running at `http://localhost:3000`
|
||||
Loading…
Reference in New Issue
Block a user