From 6d18344fe0a076eb55fa47232a4c15273484db85 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Sat, 21 Feb 2026 20:36:21 -0800 Subject: [PATCH] docs(local-llms): add 7 RTX 5090 capability deep-dive guides New capabilities/ subfolder with detailed guides: - 01: GPU inference speed (benchmarks, Ollama tuning, API usage) - 02: Whisper batch transcription (scripts, Python integration, use cases) - 03: TTS generation at scale (Orpheus + Qwen3, batch scripts, voice cloning) - 04: Fine-tuning / training (LoRA, QLoRA, data prep, Ollama export) - 05: CUDA / TensorRT / ML research (toolchain setup, Triton kernels, profiling) - 06: Stable Diffusion / image gen (ComfyUI, SDXL, FLUX, batch generation) - 07: Multi-GPU workloads (scaling path, eGPU, cloud, cost planning) - README: index with learning order and prerequisites Each guide covers: what it is, how to use it, benefits, skills to learn --- .../capabilities/01-gpu-inference-speed.md | 174 ++++++++ .../02-whisper-batch-transcription.md | 306 ++++++++++++++ .../03-tts-generation-at-scale.md | 303 ++++++++++++++ .../capabilities/04-fine-tuning-training.md | 322 +++++++++++++++ .../05-cuda-tensorrt-ml-research.md | 382 ++++++++++++++++++ .../06-stable-diffusion-image-gen.md | 325 +++++++++++++++ .../capabilities/07-multi-gpu-workloads.md | 373 +++++++++++++++++ .../windows_specific/capabilities/README.md | 52 +++ 8 files changed, 2237 insertions(+) create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/05-cuda-tensorrt-ml-research.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md create mode 100644 __LOCAL_LLMs/windows_specific/capabilities/README.md diff --git a/__LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md b/__LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md new file mode 100644 index 00000000..c759f8c7 --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/01-gpu-inference-speed.md @@ -0,0 +1,174 @@ +# 1. Raw GPU Inference Speed + +> **RTX 5090:** 2–4× faster inference on all models ≤32B compared to Mac M4 Pro +> **Why it matters:** Faster coding suggestions, faster conversations, faster iteration + +--- + +## What Is GPU Inference? + +When you run a model like `qwen2.5-coder:32b` through Ollama, the GPU does the heavy lifting — multiplying billions of numbers (matrix operations) to generate each token of the response. The speed of this process depends on: + +1. **VRAM bandwidth** — how fast data moves within the GPU +2. **Compute cores** — how many operations run in parallel +3. **VRAM capacity** — whether the full model fits without spilling to CPU RAM + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Token Generation Pipeline │ +│ │ +│ Prompt → [Tokenize] → [GPU: Matrix Multiply] → [Sample] → Token │ +│ ▲ │ +│ │ │ +│ This is the bottleneck. │ +│ RTX 5090 does this 2–4× faster. │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Performance: Mac vs Razer + +| Model | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup | +| ------------------------- | ---------------- | --------------------- | ------- | +| llama3.1:8b (4.9 GB) | ~50–70 tok/s | ~100–150 tok/s | ~2× | +| qwen2.5-coder:7b (4.7 GB) | ~40–60 tok/s | ~80–120 tok/s | ~2× | +| qwen2.5-coder:32b (19 GB) | ~15–25 tok/s | ~40–60 tok/s | ~2.5× | +| deepseek-r1:32b (19 GB) | ~15–25 tok/s | ~40–60 tok/s | ~2.5× | +| sematre/orpheus:en (4 GB) | ~realtime | ~2–3× realtime | ~2.5× | + +### Why the RTX 5090 Is Faster + +``` +┌─────────────────────────┬──────────────────────┬──────────────────────────┐ +│ Metric │ Mac M4 Pro │ RTX 5090 │ +├─────────────────────────┼──────────────────────┼──────────────────────────┤ +│ GPU memory bandwidth │ ~273 GB/s (shared) │ ~1,000+ GB/s (GDDR7) │ +│ Compute cores │ 20 Metal cores │ ~18,000 CUDA cores │ +│ Tensor cores │ None (Neural Engine) │ 5th/6th gen Tensor cores │ +│ FP16 throughput │ ~25 TFLOPS │ ~200+ TFLOPS │ +│ Model in memory? │ Yes (unified 48 GB) │ Yes (24 GB VRAM) │ +└─────────────────────────┴──────────────────────┴──────────────────────────┘ +``` + +The RTX 5090's GDDR7 bandwidth is ~4× higher, and it has ~8× the raw FP16 compute throughput. The actual speedup is 2–4× (not 8×) because inference is mostly **memory-bandwidth bound**, not compute-bound — the GPU spends most of its time reading model weights, not computing. + +--- + +## How to Use It + +### Basic: Ollama (Already Set Up) + +Ollama runs natively on Windows and uses CUDA automatically. No extra config needed. + +```bash +# From WSL2 or Windows terminal +ollama run qwen2.5-coder:32b "Write a Fastify route that validates input with Zod" +``` + +### Interactive Coding Assistant + +```bash +# Start a conversation with the 32B coding model +ollama run qwen2.5-coder:32b + +# Or use the 7B model for quick questions (faster response start) +ollama run qwen2.5-coder:7b +``` + +### From the Dashboard + +```bash +cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs +bash start-dashboard.sh +# Open http://localhost:3000 — model status + inference visible +``` + +### Benchmark Your Actual Speed + +```bash +# Quick benchmark — measure tokens per second +time ollama run qwen2.5-coder:7b "Write a Python function that implements binary search" --verbose 2>&1 | tail -5 + +# Compare models +for model in llama3.1:8b qwen2.5-coder:7b qwen2.5-coder:32b; do + echo "=== $model ===" + ollama run "$model" "Hello world" --verbose 2>&1 | grep "eval rate" +done +``` + +### API Access (for Scripts/Apps) + +```bash +# Ollama exposes a REST API at localhost:11434 +curl -s http://localhost:11434/api/generate -d '{ + "model": "qwen2.5-coder:32b", + "prompt": "Explain CUDA memory hierarchy in 3 sentences", + "stream": false +}' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])" +``` + +--- + +## Benefits + +### For Your LysnrAI / MindLyst Projects + +- **Faster code generation** — 32B model responses in ~0.5–1.5s vs ~2–4s on Mac +- **More context in less time** — can process longer prompts without waiting +- **Better for agentic workflows** — LangGraph agents that call LLMs multiple times per step run 2–4× faster end-to-end +- **Batch processing** — generate embeddings, summaries, or classifications for hundreds of items quickly + +### For Daily Coding + +- **Near-instant small model responses** — 7B at 80–120 tok/s feels like reading speed +- **Viable 32B coding assistant** — 40–60 tok/s is fast enough for real-time pair programming +- **DeepSeek-R1 reasoning** — chain-of-thought at 40–60 tok/s makes complex reasoning practical + +--- + +## Skills You'll Learn + +| Skill | What You'll Learn | Why It Matters | +| -------------------------- | -------------------------------------------------------------------- | -------------------------------------- | +| **GPU memory management** | How VRAM allocation works, model offloading, quantization trade-offs | Core ML engineering skill | +| **CUDA profiling** | Using `nvidia-smi`, `nvtop`, watching GPU utilization | Essential for optimizing AI workloads | +| **Quantization** | Q4 vs Q8 vs FP16 — speed/quality trade-offs | Industry-standard model deployment | +| **Inference optimization** | Batch size, context length, KV cache tuning | Key for production AI systems | +| **Model selection** | When to use 7B vs 32B vs 70B for different tasks | Practical AI engineering judgment | +| **REST API integration** | Building apps that call local LLM APIs | Directly applicable to LysnrAI backend | + +--- + +## Advanced: Tuning Ollama for Performance + +```bash +# Set number of GPU layers (default: all) +# Useful if you want to run 2 models with partial GPU offload +OLLAMA_NUM_GPU=999 ollama serve + +# Monitor GPU during inference +watch -n 0.5 nvidia-smi + +# Or install nvtop for a richer GPU monitor +sudo apt install nvtop +nvtop +``` + +### Ollama Environment Variables + +| Variable | Default | Description | +| -------------------------- | ------- | --------------------------------------- | +| `OLLAMA_NUM_PARALLEL` | 1 | Concurrent request slots | +| `OLLAMA_MAX_LOADED_MODELS` | 1 | Models kept in VRAM simultaneously | +| `OLLAMA_FLASH_ATTENTION` | true | Use flash attention (faster, less VRAM) | +| `OLLAMA_GPU_OVERHEAD` | 0 | Reserve VRAM (bytes) for other apps | + +--- + +## Next Steps + +- [ ] Benchmark all 5 models on the Razer and record actual tok/s +- [ ] Try `OLLAMA_NUM_PARALLEL=4` for concurrent requests +- [ ] Experiment with `OLLAMA_MAX_LOADED_MODELS=2` to keep 7B + 32B hot +- [ ] Build a simple script that compares Mac vs Razer inference times diff --git a/__LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md b/__LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md new file mode 100644 index 00000000..9aed7d86 --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/02-whisper-batch-transcription.md @@ -0,0 +1,306 @@ +# 2. Whisper Batch Transcription + +> **RTX 5090:** 8–15× realtime transcription vs 2–4× on Mac +> **Why it matters:** Hours of audio transcribed in minutes — unlocks bulk audio workflows + +--- + +## What Is Whisper? + +[Whisper](https://github.com/openai/whisper) is OpenAI's open-source speech-to-text model. We use [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — a high-performance C/C++ implementation that supports CUDA GPU acceleration. + +The `large-v3-turbo` model (~1.5 GB) delivers near-human accuracy across 99 languages. + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Whisper Transcription Pipeline │ +│ │ +│ Audio File (.wav/.mp3) │ +│ │ │ +│ ▼ │ +│ [FFmpeg: decode + resample to 16kHz mono] │ +│ │ │ +│ ▼ │ +│ [Whisper: Mel spectrogram → Encoder → Decoder → Text] │ +│ │ ▲ │ +│ │ │ GPU accelerates this (CUDA or Metal) │ +│ ▼ │ +│ Transcript (.txt / .srt / .vtt / .json) │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Performance: Mac vs Razer + +| Audio Length | Mac M4 Pro (Metal) | Razer RTX 5090 (CUDA) | Speedup | +| ------------ | ------------------ | --------------------- | ------- | +| 1 minute | ~15–30s | ~4–8s | ~3× | +| 10 minutes | ~2.5–5 min | ~40–80s | ~3× | +| 1 hour | ~15–30 min | ~4–8 min | ~3× | +| 10 hours | ~2.5–5 hours | ~40–80 min | ~3–4× | +| 100 hours | ~1–2 days | ~7–13 hours | ~3–4× | + +### Realtime Multiplier + +| Machine | Speed | Meaning | +| -------------- | --------------- | ------------------------ | +| Mac M4 Pro | ~2–4× realtime | 1 hour audio → 15–30 min | +| Razer RTX 5090 | ~8–15× realtime | 1 hour audio → 4–8 min | + +--- + +## How to Use It + +### Single File Transcription + +```bash +# Basic transcription +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav + +# With timestamps (SRT format for subtitles) +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -osrt + +# JSON output with word-level timestamps +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -ojf + +# Specify language (skip auto-detect for speed) +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l en +``` + +### Batch Transcription Script + +Create this script to transcribe an entire folder of audio files: + +```bash +#!/bin/bash +# batch-transcribe.sh — Transcribe all audio files in a directory +# Usage: bash batch-transcribe.sh /path/to/audio/files + +INPUT_DIR="${1:-.}" +MODEL="$HOME/whisper-models/ggml-large-v3-turbo.bin" +OUTPUT_DIR="${INPUT_DIR}/transcripts" + +mkdir -p "$OUTPUT_DIR" + +echo "=== Batch Whisper Transcription ===" +echo "Input: $INPUT_DIR" +echo "Output: $OUTPUT_DIR" +echo "Model: ggml-large-v3-turbo" +echo "" + +TOTAL=0 +DONE=0 +START_TIME=$(date +%s) + +# Count files +for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do + [ -f "$f" ] && ((TOTAL++)) +done + +echo "Found $TOTAL audio files" +echo "" + +for f in "$INPUT_DIR"/*.{wav,mp3,m4a,flac,ogg,webm} 2>/dev/null; do + [ -f "$f" ] || continue + ((DONE++)) + + BASENAME=$(basename "$f" | sed 's/\.[^.]*$//') + echo "[$DONE/$TOTAL] Transcribing: $(basename "$f")" + + whisper-cli \ + -m "$MODEL" \ + -f "$f" \ + -l en \ + -otxt \ + -of "$OUTPUT_DIR/$BASENAME" \ + 2>/dev/null + + # Also generate SRT for subtitle use + whisper-cli \ + -m "$MODEL" \ + -f "$f" \ + -l en \ + -osrt \ + -of "$OUTPUT_DIR/$BASENAME" \ + 2>/dev/null + + echo " -> $OUTPUT_DIR/$BASENAME.txt" + echo " -> $OUTPUT_DIR/$BASENAME.srt" + echo "" +done + +END_TIME=$(date +%s) +ELAPSED=$((END_TIME - START_TIME)) +echo "=== Done! $DONE files in ${ELAPSED}s ===" +``` + +### Convert Non-WAV Audio First + +```bash +# Whisper works best with 16kHz mono WAV +# Convert any audio format with ffmpeg + +# Single file +ffmpeg -i podcast.mp3 -ar 16000 -ac 1 podcast.wav + +# Batch convert all MP3s in a folder +for f in *.mp3; do + ffmpeg -i "$f" -ar 16000 -ac 1 "${f%.mp3}.wav" +done +``` + +### Python Integration + +```python +"""Transcribe audio using whisper.cpp via subprocess.""" +import subprocess +import json +from pathlib import Path + +def transcribe(audio_path: str, language: str = "en") -> dict: + """Transcribe an audio file and return structured result.""" + model = Path.home() / "whisper-models" / "ggml-large-v3-turbo.bin" + output_base = Path(audio_path).stem + + result = subprocess.run( + [ + "whisper-cli", + "-m", str(model), + "-f", audio_path, + "-l", language, + "-ojf", # JSON with full metadata + "-of", output_base, + ], + capture_output=True, text=True, timeout=600, + ) + + # Read the JSON output + json_path = Path(f"{output_base}.json") + if json_path.exists(): + with open(json_path) as f: + return json.load(f) + + return {"error": result.stderr, "text": result.stdout} + +# Usage +result = transcribe("meeting-recording.wav") +print(result["transcription"][0]["text"]) +``` + +--- + +## Real-World Use Cases + +### 1. LysnrAI Voice Dictation Pipeline + +Your LysnrAI desktop app captures voice → sends to Whisper for transcription. On the Razer, this becomes near-instant: + +``` +Voice input (5 seconds) → Whisper (CUDA) → Text in <1 second +``` + +### 2. Meeting Transcription + +```bash +# Record a 1-hour Zoom meeting → get full transcript in ~5 minutes +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \ + -f zoom-meeting.wav -l en -otxt -osrt +``` + +### 3. Podcast / YouTube Processing + +```bash +# Download YouTube audio +yt-dlp -x --audio-format wav "https://youtube.com/watch?v=..." -o "video.wav" + +# Transcribe +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f video.wav -otxt -osrt +``` + +### 4. Subtitle Generation + +```bash +# Generate SRT subtitles for video +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \ + -f movie.wav -l en -osrt + +# Output: movie.srt — ready to import into video editors +``` + +### 5. Multi-Language Transcription + +```bash +# Auto-detect language +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav + +# Force specific language +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ja # Japanese +whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f audio.wav -l ta # Tamil +``` + +--- + +## Benefits + +| Benefit | Description | +| --------------- | ------------------------------------------------------------ | +| **Speed** | Process a full day of meetings in under an hour | +| **Privacy** | All transcription runs locally — no data leaves your machine | +| **Cost** | Zero API costs (vs $0.006/min for cloud Whisper API) | +| **Accuracy** | large-v3-turbo is near-human accuracy for English | +| **Offline** | Works without internet — useful on flights, trains | +| **Batch scale** | Process hundreds of files overnight | + +### Cost Comparison (100 hours of audio) + +| Method | Cost | Time | +| -------------------------- | ------ | ---------------- | +| OpenAI Whisper API | ~$36 | ~minutes (cloud) | +| Mac M4 Pro (local) | $0 | ~25–50 hours | +| **Razer RTX 5090 (local)** | **$0** | **~7–13 hours** | + +--- + +## Skills You'll Learn + +| Skill | What You'll Learn | Career Value | +| ---------------------------- | ------------------------------------------------------ | -------------------------------------- | +| **Audio processing** | Sample rates, codecs, mono/stereo, WAV vs compressed | Foundational for any audio/speech work | +| **Speech-to-text pipelines** | Mel spectrograms, encoder-decoder models, beam search | Core ML/NLP skill | +| **CUDA acceleration** | How GPU parallelism speeds up neural network inference | Top ML engineering skill | +| **Batch processing** | Shell scripting for processing thousands of files | DevOps / data engineering | +| **Subtitle formats** | SRT, VTT, JSON — standards for timed text | Media tech / accessibility | +| **Model quantization** | Understanding why ggml models are smaller and faster | ML deployment knowledge | +| **ffmpeg mastery** | Audio/video conversion, resampling, format detection | Essential multimedia tool | +| **Python subprocess** | Integrating CLI tools into Python applications | Backend engineering pattern | + +--- + +## Monitoring GPU During Transcription + +```bash +# Watch GPU utilization in real-time +watch -n 0.5 nvidia-smi + +# Or use nvtop for a richer view +sudo apt install nvtop +nvtop + +# Expected during Whisper: +# GPU Utilization: 80–95% +# VRAM Usage: ~2–3 GB (model + buffers) +# Power: ~150–200W +``` + +--- + +## Next Steps + +- [ ] Transcribe a test audio file and verify output quality +- [ ] Create `batch-transcribe.sh` in `__LOCAL_LLMs/scripts/` +- [ ] Benchmark: time a 10-minute file on Razer vs Mac +- [ ] Try multi-language transcription (English + Tamil) +- [ ] Integrate Whisper output into LysnrAI transcription pipeline +- [ ] Experiment with `whisper-cli --translate` for translation mode diff --git a/__LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md b/__LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md new file mode 100644 index 00000000..15f7bcd1 --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/03-tts-generation-at-scale.md @@ -0,0 +1,303 @@ +# 3. TTS Generation at Scale + +> **RTX 5090:** Qwen3-TTS at 2–4× realtime, Orpheus at 2–3× realtime +> **Why it matters:** Pre-generate audio libraries, build voice features, create content — all faster than real-time playback + +--- + +## What Is Local TTS? + +Text-to-Speech (TTS) converts written text into natural-sounding speech. Our stack has two engines: + +| Engine | Architecture | Size | Voices | Quality | +| ------------- | ------------------------------- | ----------- | ----------------------- | ------------------------------------- | +| **Orpheus** | Ollama-served, SNAC decoder | 4 GB | 8 English voices | Extremely natural, emotional | +| **Qwen3-TTS** | PyTorch model, direct inference | 0.6B params | 10 languages, cloneable | Multilingual, zero-shot voice cloning | + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ TTS Pipeline │ +│ │ +│ Orpheus: │ +│ Text → [Ollama: generate audio tokens] → [SNAC: decode to WAV] │ +│ ▲ GPU (CUDA) ▲ GPU (CUDA) │ +│ │ +│ Qwen3-TTS: │ +│ Text → [PyTorch model: text→mel→audio] → WAV file │ +│ ▲ GPU (CUDA / MPS) │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Performance: Mac vs Razer + +| Engine | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup | +| ---------------------------- | ---------------- | --------------------- | ------- | +| Orpheus (per sentence) | ~realtime | ~2–3× realtime | ~2.5× | +| Qwen3-TTS (per sentence) | ~realtime | ~2–4× realtime | ~3× | +| Orpheus (10 min narration) | ~10 min | ~3–5 min | ~2.5× | +| Qwen3-TTS (10 min narration) | ~10 min | ~2.5–5 min | ~3× | +| Batch: 100 sentences | ~5–8 min | ~2–3 min | ~3× | +| Batch: 1000 sentences | ~50–80 min | ~15–25 min | ~3× | + +**"2–4× realtime" means:** A 10-second sentence generates in 2.5–5 seconds. The audio is produced faster than you could listen to it. + +--- + +## How to Use It + +### Orpheus TTS (8 Natural Voices) + +Orpheus runs through Ollama + SNAC decoder. Already set up by `setup-tts.sh`. + +```bash +cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs + +# Generate speech with default voice (tara) +.venv-qwen-tts/bin/python test_orpheus_tts.py + +# Output: test_orpheus_tara.wav, test_orpheus_leah.wav, etc. +# Play: aplay test_orpheus_tara.wav +``` + +#### Available Orpheus Voices + +| Voice | Character | Best For | +| ------ | ------------------ | --------------------- | +| `tara` | Young female, warm | Narration, assistants | +| `leah` | Female, clear | Tutorials, guides | +| `jess` | Female, energetic | Announcements | +| `leo` | Male, calm | Narration, podcasts | +| `dan` | Male, professional | Business content | +| `mia` | Female, friendly | Conversational | +| `zac` | Male, young | Casual content | +| `zoe` | Female, neutral | General purpose | + +#### Custom Text with Orpheus (Python) + +```python +"""Generate speech from custom text using Orpheus TTS.""" +import json +import struct +import wave +import urllib.request + +OLLAMA_URL = "http://localhost:11434/api/generate" + +def generate_speech(text: str, voice: str = "tara", output_path: str = "output.wav"): + """Generate a WAV file from text using Orpheus via Ollama.""" + prompt = f"<|audio|>{voice}: {text}" + + payload = json.dumps({ + "model": "sematre/orpheus:en", + "prompt": prompt, + "stream": False, + "options": {"temperature": 0.6, "top_p": 0.9} + }).encode() + + req = urllib.request.Request( + OLLAMA_URL, + data=payload, + headers={"Content-Type": "application/json"} + ) + + with urllib.request.urlopen(req, timeout=120) as resp: + result = json.loads(resp.read()) + + # Decode audio tokens → SNAC → WAV + # (Full implementation in test_orpheus_tts.py) + print(f"Generated: {output_path}") + +# Example +generate_speech( + "Welcome to LysnrAI. Your voice-first productivity assistant.", + voice="tara", + output_path="welcome.wav" +) +``` + +### Qwen3-TTS (Multilingual + Voice Cloning) + +```bash +cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs + +# Generate speech with Qwen3-TTS +.venv-qwen-tts/bin/python test_qwen_tts.py + +# Output: test_qwen3_tts_output.wav +# Play: aplay test_qwen3_tts_output.wav +``` + +#### Qwen3-TTS Features + +| Feature | Description | +| ------------------- | ----------------------------------------------------------------------------------------- | +| **10 languages** | English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian | +| **Voice cloning** | Provide a reference audio clip → model clones the voice | +| **Emotion control** | Adjust speaking style via prompt engineering | +| **0.6B parameters** | Small enough to run fast, large enough for quality | + +### Batch TTS Generation Script + +```bash +#!/bin/bash +# batch-tts.sh — Generate audio for a list of sentences +# Usage: bash batch-tts.sh sentences.txt output_dir/ + +INPUT_FILE="${1:-sentences.txt}" +OUTPUT_DIR="${2:-tts_output}" +VOICE="${3:-tara}" + +mkdir -p "$OUTPUT_DIR" + +echo "=== Batch Orpheus TTS ===" +echo "Input: $INPUT_FILE" +echo "Output: $OUTPUT_DIR" +echo "Voice: $VOICE" +echo "" + +LINE_NUM=0 +while IFS= read -r line; do + [ -z "$line" ] && continue + ((LINE_NUM++)) + + echo "[$LINE_NUM] Generating: ${line:0:60}..." + + # Use the Python TTS script with custom text + .venv-qwen-tts/bin/python -c " +import test_orpheus_tts as tts +tts.generate_and_save('$line', '$VOICE', '$OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav') +" 2>/dev/null + + echo " -> $OUTPUT_DIR/sentence_${LINE_NUM}_${VOICE}.wav" +done < "$INPUT_FILE" + +echo "" +echo "=== Done! $LINE_NUM files generated ===" +``` + +--- + +## Real-World Use Cases + +### 1. LysnrAI Voice Responses + +Generate spoken responses from LLM output — the Razer can produce audio faster than the user can listen: + +``` +User asks question → LLM generates text → TTS converts to speech → User hears answer + ▲ + 2–4× realtime on RTX 5090 + Feels instant for short responses +``` + +### 2. Pre-Generated Audio Libraries + +Build a library of common phrases, greetings, and responses: + +```bash +# sentences.txt +Welcome to LysnrAI. +Your daily brief is ready. +You have three new memories to review. +Recording started. +Recording saved successfully. +Transcription complete. +``` + +```bash +# Generate all phrases in multiple voices +for voice in tara leo dan; do + bash batch-tts.sh sentences.txt audio_library/ "$voice" +done +``` + +### 3. Audiobook / Podcast Generation + +```python +# Split a document into paragraphs and generate audio for each +paragraphs = open("article.txt").read().split("\n\n") + +for i, para in enumerate(paragraphs): + generate_speech(para, voice="leo", output_path=f"chapter_{i:03d}.wav") + +# Concatenate with ffmpeg +# ffmpeg -f concat -i filelist.txt -c copy full_audiobook.wav +``` + +### 4. Multilingual Content (Qwen3-TTS) + +```python +# Generate the same message in multiple languages +messages = { + "en": "Welcome to MindLyst, your AI-powered life organizer.", + "ja": "MindLystへようこそ。AIライフオーガナイザーです。", + "es": "Bienvenido a MindLyst, tu organizador de vida con IA.", +} + +for lang, text in messages.items(): + generate_qwen_tts(text, output_path=f"welcome_{lang}.wav") +``` + +### 5. Voice Cloning (Qwen3-TTS) + +```python +# Clone a voice from a reference audio sample +# Provide a 5–15 second reference clip of the target voice +reference_audio = "my_voice_sample.wav" +text = "This is my cloned voice saying something new." + +# Qwen3-TTS can reproduce the voice characteristics +generate_qwen_tts(text, reference=reference_audio, output_path="cloned.wav") +``` + +--- + +## Benefits + +| Benefit | Description | +| ----------------- | -------------------------------------------------------- | +| **Speed** | Generate audio faster than playback speed | +| **Privacy** | All voice generation runs locally — no cloud APIs | +| **Cost** | $0 vs $0.015/1K chars for cloud TTS (Google, ElevenLabs) | +| **Voice variety** | 8 Orpheus voices + unlimited Qwen3-TTS voice cloning | +| **Multilingual** | Qwen3-TTS supports 10 languages natively | +| **Offline** | Works without internet | +| **Customizable** | Control emotion, speed, voice characteristics | + +### Cost Comparison (Generate 1 hour of audio) + +| Method | Cost | Time | +| -------------------------- | ------- | ---------------- | +| ElevenLabs API | ~$15–30 | ~minutes (cloud) | +| Google Cloud TTS | ~$4–16 | ~minutes (cloud) | +| Mac M4 Pro (local) | $0 | ~60 min | +| **Razer RTX 5090 (local)** | **$0** | **~15–30 min** | + +--- + +## Skills You'll Learn + +| Skill | What You'll Learn | Career Value | +| ------------------------- | ------------------------------------------------------- | ------------------------------ | +| **Speech synthesis** | How neural TTS works (text→tokens→mel→audio) | Core speech/NLP skill | +| **Audio codecs** | SNAC, Encodec, WAV format, sample rates | Audio engineering fundamentals | +| **Voice cloning** | Zero-shot voice cloning techniques | Cutting-edge ML research area | +| **Batch processing** | Automating large-scale audio generation | Production engineering | +| **GPU memory** | Managing VRAM for concurrent TTS + LLM workloads | ML ops knowledge | +| **Audio post-processing** | ffmpeg: concatenation, normalization, format conversion | Multimedia engineering | +| **API design** | Building REST APIs around TTS engines | Backend engineering | +| **Multilingual NLP** | Cross-language text processing and pronunciation | Global product development | + +--- + +## Next Steps + +- [ ] Generate test audio with both Orpheus and Qwen3-TTS on Razer +- [ ] Create `batch-tts.sh` in `__LOCAL_LLMs/scripts/` +- [ ] Build a pre-generated audio library for LysnrAI common phrases +- [ ] Experiment with Qwen3-TTS voice cloning using your own voice +- [ ] Try generating audio in Tamil (Qwen3-TTS multilingual) +- [ ] Measure actual generation speed: words-per-second on each engine diff --git a/__LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md b/__LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md new file mode 100644 index 00000000..e6e8ae49 --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/04-fine-tuning-training.md @@ -0,0 +1,322 @@ +# 4. Fine-Tuning / Training + +> **RTX 5090:** 24 GB VRAM enables LoRA fine-tuning of 7B–13B models locally +> **Why it matters:** Customize models for your specific use cases — coding style, domain knowledge, voice commands + +--- + +## What Is Fine-Tuning? + +Fine-tuning takes a pre-trained model (like Llama 3.1 8B) and trains it further on your own data to specialize its behavior. Instead of training from scratch (which costs millions), you adjust a small fraction of the model's weights. + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Fine-Tuning vs Prompting │ +│ │ +│ Prompting: "You are a coding assistant for TypeScript..." │ +│ Works OK, but limited by context window │ +│ Model doesn't truly "learn" your preferences │ +│ │ +│ Fine-Tuning: Train on 1000s of your code examples │ +│ Model internalizes your coding patterns │ +│ Better quality, no prompt overhead, faster inference │ +│ │ +│ LoRA: Fine-tune only ~1–5% of parameters │ +│ Needs 16–24 GB VRAM (fits RTX 5090) │ +│ Training time: hours, not days │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### Why Not on Mac? + +| Aspect | Mac M4 Pro (MPS) | RTX 5090 (CUDA) | +| ----------------- | ------------------------------ | ------------------ | +| Training support | Limited MPS support | Full CUDA + cuDNN | +| Framework support | PyTorch MPS (some ops missing) | Full PyTorch CUDA | +| VRAM for training | ~30 GB usable (shared) | 24 GB dedicated | +| Memory bandwidth | ~273 GB/s | ~1,000+ GB/s | +| Training speed | 5–10× slower than CUDA | Baseline | +| LoRA libraries | Partial compatibility | Full compatibility | + +**Training is compute AND memory bandwidth intensive** — the RTX 5090's ~1 TB/s VRAM bandwidth makes it 5–10× faster than MPS for gradient computation. + +--- + +## Fine-Tuning Methods + +### LoRA (Low-Rank Adaptation) — Recommended + +Trains only small "adapter" matrices (~1–5% of model parameters). The base model stays frozen. + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ LoRA Architecture │ +│ │ +│ Base Model (frozen, ~16 GB) │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ Layer 1: [Attention] [FFN] │ │ +│ │ Layer 2: [Attention] [FFN] │ ← Not modified │ +│ │ ... │ │ +│ │ Layer N: [Attention] [FFN] │ │ +│ └─────────────────────────────────────────────┘ │ +│ ↕ small adapters (rank 8–64) │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ LoRA Adapter A (64 KB per layer) │ ← Trainable │ +│ │ LoRA Adapter B (64 KB per layer) │ ← Trainable │ +│ └─────────────────────────────────────────────┘ │ +│ │ +│ Total trainable params: ~10–50 MB (vs 8–16 GB base) │ +│ VRAM needed: ~18–22 GB for 7B model │ +│ Training time: ~1–4 hours for 1000 examples │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### QLoRA (Quantized LoRA) — For Larger Models + +Loads the base model in 4-bit quantization, trains LoRA adapters in FP16. Halves memory requirements. + +| Method | 7B Model VRAM | 13B Model VRAM | 32B Model VRAM | +| ----------------- | ------------- | -------------- | -------------- | +| Full fine-tune | ~56 GB | ~104 GB | ~256 GB | +| LoRA (FP16) | ~18 GB | ~32 GB | ~72 GB | +| **QLoRA (4-bit)** | **~8 GB** | **~14 GB** | **~22 GB** | + +**QLoRA at 32B fits in 24 GB VRAM** — this is remarkable for a laptop. + +--- + +## How to Use It + +### Setup: Install Training Libraries + +```bash +# Activate the existing TTS venv (or create a new one) +cd ~/code/mygh/learning_ai_common_plat/__LOCAL_LLMs +python3.12 -m venv .venv-training +source .venv-training/bin/activate + +# Install training stack +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 +pip install transformers datasets accelerate peft bitsandbytes trl +pip install wandb # optional: experiment tracking +``` + +### Example 1: Fine-Tune Llama 3.1 8B for Your Coding Style + +```python +"""Fine-tune Llama 3.1 8B with LoRA on your coding examples.""" +from datasets import Dataset +from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments +from peft import LoraConfig, get_peft_model +from trl import SFTTrainer + +# 1. Prepare your data +# Format: instruction → response pairs from YOUR code +training_data = [ + { + "instruction": "Write a Fastify route that creates a user with Zod validation", + "response": """import { z } from 'zod'; +import type { FastifyInstance } from 'fastify'; + +const CreateUserSchema = z.object({ + email: z.string().email(), + name: z.string().min(1).max(100), + productId: z.string(), +}); + +export async function userRoutes(app: FastifyInstance) { + app.post('/users', async (req, reply) => { + const body = CreateUserSchema.parse(req.body); + const user = await req.server.userRepository.create(body); + return reply.status(201).send(user); + }); +}""" + }, + # Add 100–1000 more examples from your actual codebase +] + +dataset = Dataset.from_list([ + {"text": f"### Instruction:\n{d['instruction']}\n\n### Response:\n{d['response']}"} + for d in training_data +]) + +# 2. Load model in 4-bit (QLoRA) +model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model = AutoModelForCausalLM.from_pretrained( + model_name, + load_in_4bit=True, + device_map="auto", +) + +# 3. Configure LoRA +lora_config = LoraConfig( + r=16, # Rank (higher = more capacity, more VRAM) + lora_alpha=32, # Scaling factor + target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], + lora_dropout=0.05, + bias="none", + task_type="CAUSAL_LM", +) + +model = get_peft_model(model, lora_config) +model.print_trainable_parameters() +# trainable params: ~10M / total: ~8B (0.13%) + +# 4. Train +training_args = TrainingArguments( + output_dir="./lora-llama-coding", + num_train_epochs=3, + per_device_train_batch_size=2, + gradient_accumulation_steps=4, + learning_rate=2e-4, + warmup_steps=10, + logging_steps=10, + save_steps=100, + fp16=True, +) + +trainer = SFTTrainer( + model=model, + train_dataset=dataset, + args=training_args, + tokenizer=tokenizer, + dataset_text_field="text", + max_seq_length=2048, +) + +trainer.train() + +# 5. Save the adapter (small file, ~50 MB) +model.save_pretrained("./lora-llama-coding") +tokenizer.save_pretrained("./lora-llama-coding") +``` + +### Example 2: Fine-Tune for LysnrAI Voice Commands + +```python +# Train a model to understand your specific voice command patterns +training_data = [ + {"instruction": "Parse: remind me to call john tomorrow at 3pm", + "response": '{"action": "reminder", "contact": "john", "time": "tomorrow 3pm", "task": "call"}'}, + {"instruction": "Parse: add milk to my grocery list", + "response": '{"action": "add_item", "list": "grocery", "item": "milk"}'}, + {"instruction": "Parse: summarize my meeting notes from yesterday", + "response": '{"action": "summarize", "source": "meeting_notes", "date": "yesterday"}'}, + # ... hundreds more examples +] +``` + +### Example 3: Convert LoRA to Ollama Model + +After training, merge the LoRA adapter and convert to GGUF for Ollama: + +```bash +# Merge LoRA adapter back into base model +python -c " +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained('./lora-llama-coding') +merged = model.merge_and_unload() +merged.save_pretrained('./merged-llama-coding') +" + +# Convert to GGUF (requires llama.cpp) +cd ~/llama.cpp +python convert_hf_to_gguf.py ../merged-llama-coding --outtype q4_k_m + +# Create Ollama model +cat > Modelfile < **RTX 5090:** Full NVIDIA toolchain — CUDA 13.x, cuDNN, TensorRT, Triton +> **Why it matters:** Most ML papers, frameworks, and production systems are CUDA-first. This is the industry-standard GPU compute platform. + +--- + +## What Is the NVIDIA ML Toolchain? + +NVIDIA provides a layered stack of tools that turn your GPU into a general-purpose compute engine: + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ NVIDIA ML Toolchain (RTX 5090) │ +│ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Your Code (Python / C++ / TypeScript) │ │ +│ └────────────────────────┬────────────────────────────────────┘ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ ML Frameworks │ │ +│ │ PyTorch · TensorFlow · JAX · ONNX Runtime │ │ +│ └────────────────────────┬────────────────────────────────────┘ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ NVIDIA Libraries │ │ +│ │ TensorRT · Triton · cuDNN · cuBLAS · NCCL │ │ +│ └────────────────────────┬────────────────────────────────────┘ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ CUDA Runtime + Driver │ │ +│ │ CUDA 13.x · GPU scheduling · memory management │ │ +│ └────────────────────────┬────────────────────────────────────┘ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Hardware │ │ +│ │ RTX 5090: ~18K CUDA cores · 24 GB GDDR7 · Tensor cores │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### Component Breakdown + +| Component | What It Does | Why You Need It | +| ------------------- | ----------------------------------------------------------- | -------------------------------- | +| **CUDA** | General-purpose GPU programming | Foundation for everything | +| **cuDNN** | Optimized neural network primitives (conv, attention, etc.) | 2–5× faster training/inference | +| **TensorRT** | Model optimization + inference engine | 2–4× faster deployment inference | +| **Triton** (NVIDIA) | Inference server for serving models at scale | Production model serving | +| **Triton** (OpenAI) | GPU kernel compiler (write custom GPU kernels in Python) | Research + custom ops | +| **cuBLAS** | Optimized matrix multiplication | Core of all neural network math | +| **NCCL** | Multi-GPU communication | Distributed training (future) | + +--- + +## How to Set Up + +### 1. CUDA Toolkit (Inside WSL2) + +```bash +# Check if CUDA is already available (from Windows driver passthrough) +nvidia-smi +nvcc --version # CUDA compiler + +# If nvcc is not found, install CUDA toolkit +# (nvidia-smi works from driver passthrough, but nvcc needs the toolkit) +wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb +sudo dpkg -i cuda-keyring_1.1-1_all.deb +sudo apt update +sudo apt install -y cuda-toolkit-12-4 + +# Add to PATH +echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc +echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc +source ~/.bashrc + +# Verify +nvcc --version +# Should show: CUDA 12.4+ +``` + +### 2. cuDNN + +```bash +# cuDNN is usually bundled with PyTorch, but for custom builds: +sudo apt install -y libcudnn8 libcudnn8-dev + +# Verify in Python +python3 -c "import torch; print(f'cuDNN: {torch.backends.cudnn.version()}')" +``` + +### 3. TensorRT + +```bash +# Install TensorRT +pip install tensorrt + +# Or via apt for system-wide +sudo apt install -y tensorrt + +# Verify +python3 -c "import tensorrt; print(f'TensorRT: {tensorrt.__version__}')" +``` + +### 4. PyTorch with Full CUDA Support + +```bash +# Create a research environment +python3.12 -m venv ~/.venv-ml-research +source ~/.venv-ml-research/bin/activate + +# Install PyTorch with CUDA 12.4 +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 + +# Verify +python3 -c " +import torch +print(f'PyTorch: {torch.__version__}') +print(f'CUDA available: {torch.cuda.is_available()}') +print(f'GPU: {torch.cuda.get_device_name(0)}') +print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB') +print(f'cuDNN: {torch.backends.cudnn.version()}') +" +``` + +--- + +## How to Use It + +### CUDA Programming Basics (Python) + +```python +"""Your first CUDA program — matrix multiplication on the GPU.""" +import torch +import time + +device = torch.device("cuda") + +# Create two large matrices on the GPU +A = torch.randn(4096, 4096, device=device) +B = torch.randn(4096, 4096, device=device) + +# Warm up +_ = torch.mm(A, B) +torch.cuda.synchronize() + +# Benchmark +start = time.perf_counter() +for _ in range(100): + C = torch.mm(A, B) +torch.cuda.synchronize() +elapsed = time.perf_counter() - start + +tflops = (2 * 4096**3 * 100) / elapsed / 1e12 +print(f"Matrix multiply: {elapsed:.3f}s for 100 iterations") +print(f"Throughput: {tflops:.1f} TFLOPS") +# Expected on RTX 5090: ~100–200 TFLOPS (FP32) or ~200–400 TFLOPS (FP16) +``` + +### TensorRT: Optimize a Model for Faster Inference + +```python +"""Convert a PyTorch model to TensorRT for 2–4× faster inference.""" +import torch +import torch_tensorrt + +# Load a model +model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True).eval().cuda() + +# Compile with TensorRT +trt_model = torch_tensorrt.compile( + model, + inputs=[torch_tensorrt.Input(shape=[1, 3, 224, 224], dtype=torch.float16)], + enabled_precisions={torch.float16}, +) + +# Benchmark +input_tensor = torch.randn(1, 3, 224, 224, device="cuda", dtype=torch.float16) + +# PyTorch baseline +with torch.no_grad(): + start = time.perf_counter() + for _ in range(1000): + _ = model(input_tensor.float()) + torch.cuda.synchronize() + pytorch_time = time.perf_counter() - start + +# TensorRT optimized +with torch.no_grad(): + start = time.perf_counter() + for _ in range(1000): + _ = trt_model(input_tensor) + torch.cuda.synchronize() + trt_time = time.perf_counter() - start + +print(f"PyTorch: {pytorch_time:.3f}s") +print(f"TensorRT: {trt_time:.3f}s") +print(f"Speedup: {pytorch_time/trt_time:.1f}×") +``` + +### Reproducing ML Research Papers + +Most ML papers provide CUDA-only code. The RTX 5090 lets you run them directly: + +```bash +# Example: Run a recent paper's code +git clone https://github.com/some-researcher/cool-new-model.git +cd cool-new-model + +# Typical requirements +pip install -r requirements.txt + +# Run training/evaluation +python train.py --device cuda --epochs 10 --batch-size 16 + +# This would NOT work on Mac (CUDA-only dependencies) +``` + +### Custom CUDA Kernels with Triton (OpenAI) + +```python +"""Write a custom GPU kernel in Python using Triton.""" +import triton +import triton.language as tl +import torch + +@triton.jit +def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): + """Simple vector addition kernel running on GPU.""" + pid = tl.program_id(axis=0) + block_start = pid * BLOCK_SIZE + offsets = block_start + tl.arange(0, BLOCK_SIZE) + mask = offsets < n_elements + + x = tl.load(x_ptr + offsets, mask=mask) + y = tl.load(y_ptr + offsets, mask=mask) + output = x + y + tl.store(output_ptr + offsets, output, mask=mask) + +# Run the kernel +n = 1_000_000 +x = torch.randn(n, device="cuda") +y = torch.randn(n, device="cuda") +output = torch.empty_like(x) + +grid = lambda meta: (triton.cdiv(n, meta["BLOCK_SIZE"]),) +add_kernel[grid](x, y, output, n, BLOCK_SIZE=1024) + +print(f"Result correct: {torch.allclose(output, x + y)}") +``` + +--- + +## Real-World Use Cases for Your Projects + +### 1. LysnrAI Inference Optimization + +Convert your Whisper or TTS models to TensorRT for even faster inference: + +```python +# Optimize Whisper encoder with TensorRT +# This can give another 2× speedup on top of whisper.cpp CUDA +``` + +### 2. Custom Embedding Models + +Train domain-specific embedding models for LysnrAI's semantic search: + +```python +from sentence_transformers import SentenceTransformer + +model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda") + +# Encode your documents +documents = ["meeting notes about Q1 budget", "grocery list for weekend", ...] +embeddings = model.encode(documents, batch_size=64, show_progress_bar=True) +# ~1000 documents/second on RTX 5090 +``` + +### 3. Reproduce and Experiment with New Models + +When a new paper drops (GPT-5 alternatives, new TTS models, new architectures): + +```bash +# Clone → install → run — no CUDA compatibility issues +git clone https://github.com/new-cool-model +pip install -r requirements.txt +python evaluate.py --device cuda +``` + +### 4. Benchmarking and Profiling + +```python +# Profile GPU usage during inference +with torch.profiler.profile( + activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], + with_stack=True, +) as prof: + model(input_tensor) + +print(prof.key_averages().table(sort_by="cuda_time_total")) +``` + +--- + +## Benefits + +| Benefit | Description | +| ------------------------- | --------------------------------------------------------- | +| **Industry standard** | 95%+ of ML research and production uses CUDA | +| **Framework support** | PyTorch, TensorFlow, JAX — all CUDA-first | +| **Paper reproduction** | Run any ML paper's code without compatibility issues | +| **TensorRT optimization** | 2–4× faster inference on optimized models | +| **Custom kernels** | Write GPU code in Python (Triton) or C++ (CUDA) | +| **Profiling tools** | nvidia-smi, Nsight, PyTorch profiler — rich debugging | +| **Production parity** | Code runs identically on cloud GPU instances (A100, H100) | + +--- + +## Skills You'll Learn + +| Skill | What You'll Learn | Career Value | +| ------------------------- | ----------------------------------------------------- | ---------------------------------- | +| **CUDA fundamentals** | GPU memory model, kernel launches, synchronization | Core ML infrastructure skill | +| **TensorRT** | Model optimization, quantization, graph fusion | Production ML deployment | +| **Triton kernels** | Custom GPU programming in Python | Research + performance engineering | +| **PyTorch profiling** | Identifying bottlenecks, optimizing training loops | Essential ML engineering | +| **cuDNN** | Optimized neural network operations | Framework-level understanding | +| **Mixed precision** | FP16/BF16 training, loss scaling, numerical stability | Modern training standard | +| **GPU memory management** | Memory pools, caching allocator, OOM debugging | Practical ML engineering | +| **Model optimization** | Graph optimization, operator fusion, quantization | ML deployment pipeline | +| **Benchmark design** | Fair comparison methodology, statistical significance | Research methodology | + +### Career Impact + +CUDA proficiency is one of the most valuable ML engineering skills. Here's where it maps: + +| Role | How CUDA Skills Apply | +| --------------------- | ----------------------------------------------------- | +| **ML Engineer** | Optimize training pipelines, reduce inference latency | +| **AI Infrastructure** | Build and maintain GPU clusters, optimize throughput | +| **Research Engineer** | Implement custom operations for novel architectures | +| **MLOps** | TensorRT deployment, GPU monitoring, autoscaling | +| **Full-Stack AI** | End-to-end: train → optimize → serve → monitor | + +--- + +## Monitoring and Debugging + +```bash +# Real-time GPU monitoring +watch -n 0.5 nvidia-smi + +# Detailed GPU info +nvidia-smi -q + +# GPU process list +nvidia-smi pmon + +# Install nvtop (interactive GPU monitor) +sudo apt install nvtop +nvtop + +# PyTorch memory debugging +python3 -c " +import torch +torch.cuda.memory_summary(device=0, abbreviated=True) +" +``` + +--- + +## Next Steps + +- [ ] Install CUDA toolkit + cuDNN in WSL2 +- [ ] Verify PyTorch CUDA with a matrix multiply benchmark +- [ ] Run a TensorRT optimization on a simple model +- [ ] Write a Triton kernel (vector add → custom attention) +- [ ] Profile an Ollama inference request with nvidia-smi +- [ ] Try reproducing a recent ML paper from GitHub +- [ ] Benchmark: PyTorch vs TensorRT inference speed for Whisper diff --git a/__LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md b/__LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md new file mode 100644 index 00000000..99d33cce --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/06-stable-diffusion-image-gen.md @@ -0,0 +1,325 @@ +# 6. Stable Diffusion / Image Generation + +> **RTX 5090:** 5–8 seconds per image (SDXL) vs ~30 seconds on Mac +> **Why it matters:** Generate UI mockups, app icons, marketing assets, concept art — all locally and free + +--- + +## What Is Stable Diffusion? + +Stable Diffusion is an open-source text-to-image AI model. You describe what you want in plain English, and it generates a high-quality image in seconds. It runs entirely on your GPU. + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Stable Diffusion Pipeline │ +│ │ +│ "A modern app dashboard with dark theme and blue accents" │ +│ │ │ +│ ▼ │ +│ [CLIP Text Encoder] → text embeddings │ +│ │ │ +│ ▼ │ +│ [U-Net: iterative denoising × 20–50 steps] ← GPU-intensive │ +│ │ │ +│ ▼ │ +│ [VAE Decoder] → pixel image │ +│ │ │ +│ ▼ │ +│ 1024×1024 PNG image │ +│ │ +│ RTX 5090: ~5–8s per image (SDXL) │ +│ Mac M4 Pro: ~25–35s per image (SDXL via MPS) │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Performance: Mac vs Razer + +| Model | Resolution | Mac M4 Pro (MPS) | Razer RTX 5090 (CUDA) | Speedup | +| ---------------- | ---------- | ---------------- | --------------------- | ------- | +| SD 1.5 | 512×512 | ~8–12s | ~1–2s | ~5× | +| SDXL | 1024×1024 | ~25–35s | ~5–8s | ~4× | +| SDXL Turbo | 1024×1024 | ~8–12s | ~1–3s | ~4× | +| FLUX.1 [dev] | 1024×1024 | ~60–90s | ~10–20s | ~5× | +| FLUX.1 [schnell] | 1024×1024 | ~15–25s | ~3–6s | ~4× | + +### VRAM Requirements + +| Model | VRAM Needed | Fits in 24 GB? | +| ----------------- | ----------- | -------------- | +| SD 1.5 | ~4 GB | ✅ Easily | +| SDXL | ~7 GB | ✅ Easily | +| SDXL + ControlNet | ~10 GB | ✅ Yes | +| FLUX.1 [dev] | ~12 GB | ✅ Yes | +| FLUX.1 + LoRA | ~14 GB | ✅ Yes | +| SD3 Medium | ~12 GB | ✅ Yes | + +**24 GB VRAM means every current image model fits comfortably.** + +--- + +## How to Set Up + +### Option A: ComfyUI (Node-Based — Recommended) + +ComfyUI is a powerful node-based UI for Stable Diffusion. It's flexible, efficient, and well-suited for automation. + +```bash +# Clone ComfyUI +cd ~ +git clone https://github.com/comfyanonymous/ComfyUI.git +cd ComfyUI + +# Create venv and install +python3.12 -m venv venv +source venv/bin/activate +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 +pip install -r requirements.txt + +# Download SDXL model (~6.5 GB) +mkdir -p models/checkpoints +cd models/checkpoints +wget "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors" + +# Start ComfyUI +cd ~/ComfyUI +python main.py + +# Open in browser: http://localhost:8188 +# Accessible from Windows browser via WSL2 port forwarding +``` + +### Option B: Automatic1111 (Classic Web UI) + +```bash +# Clone A1111 +cd ~ +git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git +cd stable-diffusion-webui + +# Launch (auto-installs deps on first run) +bash webui.sh --listen --xformers + +# Open: http://localhost:7860 +``` + +### Option C: Python Script (No UI) + +```python +"""Generate images with Stable Diffusion from Python.""" +import torch +from diffusers import StableDiffusionXLPipeline + +# Load SDXL +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +# Enable memory optimizations +pipe.enable_xformers_memory_efficient_attention() + +# Generate +image = pipe( + prompt="A sleek dark-themed mobile app dashboard showing AI brain categories, " + "neon blue and teal accents, glassmorphism cards, modern UI design", + negative_prompt="blurry, low quality, text, watermark", + num_inference_steps=30, + guidance_scale=7.5, + width=1024, + height=1024, +).images[0] + +image.save("dashboard_concept.png") +print("Generated: dashboard_concept.png") +``` + +### Batch Generation Script + +```python +"""Generate multiple variations of an image concept.""" +import torch +from diffusers import StableDiffusionXLPipeline +import time + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") +pipe.enable_xformers_memory_efficient_attention() + +prompts = [ + "App icon for LysnrAI, microphone with sound waves, dark background, modern gradient", + "App icon for MindLyst, brain with neural connections, dark background, blue teal gradient", + "Feature graphic for LysnrAI, voice waveform visualization, dark theme, 1024x500", + "Splash screen, abstract sound waves, dark navy background, teal highlights", + "Dashboard mockup, dark theme, cards with charts, sidebar navigation, modern UI", +] + +for i, prompt in enumerate(prompts): + start = time.time() + image = pipe( + prompt=prompt, + negative_prompt="blurry, low quality, text, watermark, ugly", + num_inference_steps=30, + guidance_scale=7.5, + width=1024, + height=1024, + generator=torch.Generator("cuda").manual_seed(42 + i), + ).images[0] + + elapsed = time.time() - start + filename = f"generated_{i:02d}.png" + image.save(filename) + print(f"[{i+1}/{len(prompts)}] {filename} ({elapsed:.1f}s)") +``` + +--- + +## Real-World Use Cases for Your Projects + +### 1. App Store Assets + +Generate icon concepts, feature graphics, and screenshot backgrounds: + +```python +# LysnrAI app icon variations +prompts = [ + "Minimal app icon, single microphone, dark navy background, glowing teal outline, flat design", + "App icon, sound wave forming a brain shape, dark background, blue to teal gradient", + "App icon, headphones with AI sparkles, dark background, modern glassmorphism", +] +``` + +### 2. UI/UX Mockup Exploration + +Rapidly prototype visual ideas before coding: + +```python +# Generate dashboard layout concepts +prompt = """ +Web dashboard design, dark theme (#06070A background), sidebar navigation, +main content area with 3 cards showing brain categories, +teal and blue accent colors, modern glassmorphism, +high fidelity UI design, clean typography +""" +``` + +### 3. Marketing and Social Media + +```python +# Blog post hero images +prompt = "Abstract AI visualization, neural network nodes connected by light beams, dark background, blue and teal colors, cinematic lighting" + +# Social media posts +prompt = "Infographic style, voice AI assistant concept, microphone with sound waves transforming into text, dark modern design" +``` + +### 4. Concept Art for Features + +```python +# Visualize MindLyst "Brains" concept +prompts = [ + "Digital brain labeled 'Work', organized files and charts floating around it, dark theme, blue glow", + "Digital brain labeled 'Health', fitness and medical icons floating around it, dark theme, green glow", + "Digital brain labeled 'Finance', money and graphs floating around it, dark theme, gold glow", +] +``` + +### 5. ControlNet (Image-Guided Generation) + +Use an existing image as a structural guide: + +```python +from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel + +# Load ControlNet for edge-guided generation +controlnet = ControlNetModel.from_pretrained( + "diffusers/controlnet-canny-sdxl-1.0", + torch_dtype=torch.float16, +).to("cuda") + +# Use your existing dashboard screenshot as a structural guide +# Generate a redesigned version with new visual style +``` + +--- + +## Benefits + +| Benefit | Description | +| --------------------------- | ---------------------------------------------------------------- | +| **Speed** | 5–8s per image (vs 30s on Mac or waiting for cloud APIs) | +| **Cost** | $0 per image (vs $0.02–0.08 per image for DALL-E 3 / Midjourney) | +| **Privacy** | Generated locally — no images sent to cloud | +| **Control** | Full parameter control: steps, guidance, seed, resolution | +| **Reproducibility** | Same seed = same image every time | +| **Customization** | LoRA fine-tuning for brand-specific styles | +| **Batch capability** | Generate hundreds of variations overnight | +| **No content restrictions** | No cloud content policies limiting your output | + +### Cost Comparison (100 images) + +| Method | Cost | Time | +| -------------------------------- | ------------ | -------------- | +| DALL-E 3 API | $4–8 | ~5 min (cloud) | +| Midjourney | $10–30/month | ~5 min (cloud) | +| Mac M4 Pro (SDXL, local) | $0 | ~40–55 min | +| **Razer RTX 5090 (SDXL, local)** | **$0** | **~8–13 min** | + +--- + +## Skills You'll Learn + +| Skill | What You'll Learn | Career Value | +| ------------------------- | --------------------------------------------------- | ---------------------------- | +| **Diffusion models** | How iterative denoising generates images from noise | Core generative AI knowledge | +| **Prompt engineering** | Crafting effective text prompts for visual output | Universal AI skill | +| **ControlNet** | Structural guidance for image generation | Advanced image AI | +| **LoRA training** | Fine-tuning image models for specific styles | Generative AI customization | +| **ComfyUI workflows** | Node-based visual programming for AI pipelines | Production image generation | +| **Image post-processing** | Upscaling, inpainting, outpainting techniques | Digital content creation | +| **VRAM optimization** | Model offloading, attention optimization, tiling | GPU memory management | +| **Batch automation** | Scripting large-scale image generation | Production engineering | +| **Model selection** | SD 1.5 vs SDXL vs FLUX — trade-offs | Practical AI judgment | + +--- + +## Advanced: FLUX.1 (Latest Generation) + +FLUX.1 is the newest open-source image model (from Black Forest Labs / Stability AI alumni). Better quality than SDXL. + +```bash +# FLUX.1 [schnell] — fast, 4 steps +# FLUX.1 [dev] — high quality, 50 steps + +# Fits in 24 GB VRAM with FP16 +python3 -c " +from diffusers import FluxPipeline +import torch + +pipe = FluxPipeline.from_pretrained( + 'black-forest-labs/FLUX.1-schnell', + torch_dtype=torch.float16, +).to('cuda') + +image = pipe('A futuristic AI assistant interface, holographic UI, dark theme').images[0] +image.save('flux_test.png') +" +``` + +--- + +## Next Steps + +- [ ] Install ComfyUI in WSL2 and verify CUDA acceleration +- [ ] Download SDXL base model and generate first test image +- [ ] Generate app icon concepts for LysnrAI and MindLyst +- [ ] Try ControlNet with an existing dashboard screenshot +- [ ] Experiment with FLUX.1 [schnell] for higher quality +- [ ] Create a batch script for generating marketing assets +- [ ] Build a style LoRA trained on your brand colors/aesthetic diff --git a/__LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md b/__LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md new file mode 100644 index 00000000..3648d0cf --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/07-multi-gpu-workloads.md @@ -0,0 +1,373 @@ +# 7. Multi-GPU Workloads (Future Path) + +> **RTX 5090:** Your first serious CUDA GPU — a stepping stone to multi-GPU and cloud GPU workflows +> **Why it matters:** The skills, code, and workflows you build on one GPU translate directly to multi-GPU and cloud infrastructure + +--- + +## Why Think About Multi-GPU Now? + +You don't need multiple GPUs today. But learning on a single RTX 5090 builds skills that scale directly: + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ GPU Scaling Path │ +│ │ +│ TODAY │ +│ ┌─────────────────────────────────┐ │ +│ │ RTX 5090 Laptop (24 GB VRAM) │ ← You are here │ +│ │ Single GPU, local inference │ │ +│ │ Fine-tuning up to 13B models │ │ +│ └────────────────┬────────────────┘ │ +│ │ │ +│ NEAR FUTURE ▼ │ +│ ┌─────────────────────────────────┐ │ +│ │ Desktop + eGPU or 2× GPU tower │ │ +│ │ 48–96 GB total VRAM │ │ +│ │ Fine-tune 70B models │ │ +│ │ Run 2–3 models simultaneously │ │ +│ └────────────────┬────────────────┘ │ +│ │ │ +│ CLOUD / PROD ▼ │ +│ ┌─────────────────────────────────┐ │ +│ │ Cloud GPU instances │ │ +│ │ A100/H100 × 2–8 (80–640 GB) │ │ +│ │ Train large models │ │ +│ │ Serve at scale │ │ +│ └─────────────────────────────────┘ │ +│ │ +│ SAME CUDA code works at every level ↑ │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## What Multi-GPU Enables + +| Capability | Single GPU (24 GB) | 2× GPU (48 GB) | 4× GPU (96 GB) | Cloud (640 GB) | +| --------------------- | ------------------ | --------------- | --------------- | -------------- | +| 7B inference | ✅ Very fast | ✅ + concurrent | ✅ + concurrent | ✅ at scale | +| 32B inference | ✅ Fits in VRAM | ✅ Very fast | ✅ Very fast | ✅ at scale | +| 70B inference | ⚠️ Partial GPU | ✅ Full GPU | ✅ Very fast | ✅ at scale | +| 7B fine-tune (QLoRA) | ✅ Comfortable | ✅ Faster | ✅ Faster | ✅ Fastest | +| 13B fine-tune (QLoRA) | ✅ Fits | ✅ Comfortable | ✅ Fast | ✅ Fastest | +| 70B fine-tune (QLoRA) | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable | +| 7B full fine-tune | ❌ OOM | ⚠️ Tight | ✅ Fits | ✅ Comfortable | +| Serve 3+ models | ❌ VRAM limit | ✅ Yes | ✅ Yes | ✅ Yes | +| SDXL + LLM concurrent | ⚠️ Tight | ✅ Yes | ✅ Yes | ✅ Yes | + +--- + +## Expansion Options + +### Option 1: eGPU (Thunderbolt/USB4) + +Connect an external GPU to your Razer Blade via Thunderbolt 4: + +``` +┌──────────────────────────────────┐ Thunderbolt 4 ┌──────────────────────┐ +│ Razer Blade 18 │◄═══(~40 Gbps)════════►│ eGPU Enclosure │ +│ RTX 5090 (24 GB) — internal │ │ RTX 4090 (24 GB) │ +│ │ │ or RTX 5090 (24 GB) │ +└──────────────────────────────────┘ └──────────────────────┘ + +Total VRAM: 48 GB (24 + 24) +Limitation: Thunderbolt bandwidth (~40 Gbps) is slower than PCIe 5.0 (~128 Gbps) +Best for: Model serving (latency-tolerant), not training (bandwidth-sensitive) +``` + +**Recommended eGPU enclosures:** +| Enclosure | Price | GPU Support | +|-----------|-------|-------------| +| Razer Core X | ~$300 | Full-length desktop GPUs | +| Sonnet Breakaway | ~$250 | Most desktop GPUs | +| Akitio Node | ~$200 | Compact form factor | + +### Option 2: Desktop Build (Maximum Performance) + +Build a dedicated GPU workstation: + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Desktop GPU Workstation │ +│ │ +│ Motherboard: ASUS/MSI with 2–4× PCIe 5.0 x16 slots │ +│ CPU: Intel i9 or AMD Ryzen 9 (enough PCIe lanes) │ +│ RAM: 128 GB DDR5 │ +│ GPU 1: RTX 5090 (24 GB GDDR7) │ +│ GPU 2: RTX 5090 (24 GB GDDR7) │ +│ Total VRAM: 48 GB │ +│ PSU: 1200W+ (two 5090s draw ~600W under load) │ +│ │ +│ Cost: ~$5,000–7,000 │ +│ Performance: Near-datacenter for most workloads │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### Option 3: Cloud GPU (On-Demand) + +Rent GPU time when you need it: + +| Provider | GPU | VRAM | Price/Hour | Best For | +| ----------- | --------- | ------ | ---------- | -------------------- | +| Lambda Labs | A100 80GB | 80 GB | ~$1.10 | Training | +| RunPod | A100 80GB | 80 GB | ~$1.64 | Training + inference | +| Vast.ai | RTX 4090 | 24 GB | ~$0.30 | Budget inference | +| AWS (p4d) | A100 ×8 | 640 GB | ~$32 | Large-scale training | +| Together AI | H100 | 80 GB | ~$2.50 | Fine-tuning API | + +**Your RTX 5090 code runs identically on cloud GPUs** — same PyTorch, same CUDA. + +--- + +## How to Prepare Now (Single GPU) + +### 1. Write GPU-Agnostic Code + +Structure your code so it works with any number of GPUs: + +```python +"""GPU-agnostic model loading — works with 1 or N GPUs.""" +import torch + +def get_device(): + """Select the best available device.""" + if torch.cuda.is_available(): + # Multi-GPU: use DataParallel or DistributedDataParallel + if torch.cuda.device_count() > 1: + print(f"Using {torch.cuda.device_count()} GPUs") + return "cuda" # PyTorch handles multi-GPU distribution + return "cuda:0" + elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): + return "mps" + return "cpu" + +device = get_device() +model = MyModel().to(device) + +# Wrap for multi-GPU (no-op on single GPU) +if torch.cuda.device_count() > 1: + model = torch.nn.DataParallel(model) +``` + +### 2. Learn Model Parallelism Concepts + +```python +"""Pipeline parallelism — split model layers across GPUs.""" +# Example: split a large model across 2 GPUs + +# GPU 0: layers 0–15 +# GPU 1: layers 16–31 + +# With Hugging Face Accelerate: +from accelerate import init_empty_weights, load_checkpoint_and_dispatch + +with init_empty_weights(): + model = AutoModelForCausalLM.from_config(config) + +model = load_checkpoint_and_dispatch( + model, + checkpoint="./model-weights", + device_map="auto", # Automatically distributes across available GPUs +) +``` + +### 3. Ollama Multi-GPU (Already Supported) + +Ollama can split a single large model across multiple GPUs: + +```bash +# When you have 2 GPUs, Ollama auto-detects and splits +# A 70B model: 24 GB on GPU 0, 16 GB on GPU 1, rest in RAM + +# Or manually control GPU assignment +CUDA_VISIBLE_DEVICES=0,1 ollama serve + +# Check GPU allocation +nvidia-smi # Shows VRAM usage per GPU +``` + +### 4. vLLM (High-Throughput Inference Server) + +```bash +# vLLM supports tensor parallelism across GPUs +pip install vllm + +# Serve a model across 2 GPUs +python -m vllm.entrypoints.openai.api_server \ + --model meta-llama/Meta-Llama-3.1-70B-Instruct \ + --tensor-parallel-size 2 \ + --gpu-memory-utilization 0.9 + +# API compatible with OpenAI format +curl http://localhost:8000/v1/completions -d '{ + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "prompt": "Hello", + "max_tokens": 100 +}' +``` + +--- + +## Scaling Scenarios for Your Projects + +### Scenario 1: LysnrAI at Scale + +``` +TODAY (1× RTX 5090): + - 1 user, local inference + - Whisper + Ollama + TTS, one at a time + +FUTURE (2× GPU desktop): + - GPU 0: Ollama (always-on coding model) + - GPU 1: Whisper + TTS (dedicated) + - Run all 3 workloads concurrently + +PRODUCTION (Cloud): + - vLLM serving on A100 + - Whisper on dedicated GPU + - TTS on dedicated GPU + - Handles 100+ concurrent users +``` + +### Scenario 2: Fine-Tuning Larger Models + +``` +TODAY (24 GB VRAM): + - QLoRA 7B–13B models + - Training time: 1–4 hours + +FUTURE (48 GB VRAM): + - QLoRA 70B models + - LoRA FP16 32B models + - Training time: 4–12 hours + +CLOUD (80+ GB VRAM): + - Full fine-tune 7B–13B models + - QLoRA 70B+ models + - Training time: hours with H100 +``` + +### Scenario 3: Image + Text Generation Pipeline + +``` +TODAY (1× GPU): + - SDXL OR LLM, not both at once (VRAM constraint) + +FUTURE (2× GPU): + - GPU 0: LLM (Ollama, 32B model, ~19 GB) + - GPU 1: SDXL + ControlNet (~10 GB) + - Generate images guided by LLM descriptions + +PRODUCTION: + - Automated content pipeline: + LLM writes description → SDXL generates image → Whisper adds voiceover +``` + +--- + +## Benefits of Starting Single-GPU + +| Benefit | Description | +| ------------------------ | ------------------------------------------------------------------- | +| **Code portability** | CUDA code runs the same on 1, 2, or 100 GPUs | +| **Skill foundation** | Memory management, profiling, optimization skills transfer directly | +| **Cost efficiency** | Learn on local hardware ($0) before renting cloud ($$$) | +| **Workflow development** | Build training pipelines, inference servers, batch scripts now | +| **Hardware literacy** | Understand VRAM limits, bandwidth, PCIe bottlenecks | + +--- + +## Skills You'll Build Toward + +| Skill | Single GPU (Now) | Multi-GPU (Future) | Career Impact | +| ------------------------ | ------------------ | ---------------------------- | ----------------------- | +| **CUDA programming** | Kernels, memory | NCCL, all-reduce | ML Infrastructure | +| **Model parallelism** | Understand concept | Implement tensor/pipeline | Senior ML Engineer | +| **Distributed training** | Data loading | Multi-node coordination | ML Platform Engineer | +| **Inference serving** | Ollama, local API | vLLM, Triton, load balancing | MLOps / Production | +| **GPU monitoring** | nvidia-smi, nvtop | Cluster monitoring | DevOps / SRE | +| **Cost optimization** | VRAM budget | Spot instances, right-sizing | FinOps / ML Engineering | +| **Batch scheduling** | Cron jobs | Kubernetes, Slurm | ML Platform | + +### Learning Path + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ Skills Progression │ +│ │ +│ Level 1 (Now — RTX 5090 Single GPU) │ +│ ├── PyTorch + CUDA basics │ +│ ├── Ollama model serving │ +│ ├── QLoRA fine-tuning 7B models │ +│ ├── nvidia-smi monitoring │ +│ └── TensorRT basic optimization │ +│ │ +│ Level 2 (6 months — Same GPU, deeper skills) │ +│ ├── Custom Triton kernels │ +│ ├── vLLM inference server │ +│ ├── Advanced quantization (AWQ, GPTQ) │ +│ ├── Profiling and optimization │ +│ └── Model merging and adapter stacking │ +│ │ +│ Level 3 (12 months — Multi-GPU or Cloud) │ +│ ├── Multi-GPU inference (tensor parallelism) │ +│ ├── Distributed training (DDP, FSDP) │ +│ ├── Cloud GPU workflow (Lambda, RunPod) │ +│ ├── Production serving with autoscaling │ +│ └── NCCL and multi-node communication │ +│ │ +│ Level 4 (Future — Production ML) │ +│ ├── Kubernetes + GPU scheduling │ +│ ├── Model serving at scale (thousands of requests/sec) │ +│ ├── Training pipelines with experiment tracking │ +│ ├── Custom model architectures │ +│ └── Open-source ML contributions │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Cost Planning + +### Single GPU (Current) + +| Item | Cost | Status | +| ---------------------------------- | ----------- | --------- | +| Razer Blade 18 RTX 5090 | $5,200 | Purchased | +| Electricity (~200W avg, 8 hrs/day) | ~$15/month | Ongoing | +| **Total first year** | **~$5,380** | | + +### Desktop Upgrade (Future) + +| Item | Estimated Cost | +| -------------------------- | -------------- | +| Desktop tower + PSU + RAM | ~$1,500 | +| RTX 5090 desktop GPU | ~$2,000 | +| Second RTX 5090 (optional) | ~$2,000 | +| **Total (2× GPU desktop)** | **~$5,500** | + +### Cloud Alternative (Per-Use) + +| Usage Pattern | Monthly Cost | +| ----------------------- | ------------ | +| 10 hours/month on A100 | ~$11 | +| 40 hours/month on A100 | ~$44 | +| 160 hours/month on A100 | ~$176 | +| Always-on A100 | ~$792 | + +**Break-even vs desktop:** ~12–18 months at heavy usage (40+ hours/month). + +--- + +## Next Steps + +- [ ] Write all training and inference scripts to be GPU-count agnostic +- [ ] Install and test vLLM on single GPU with Llama 3.1 8B +- [ ] Practice monitoring GPU memory and compute utilization +- [ ] Try model offloading: run a 70B model with partial CPU/GPU split +- [ ] Explore Lambda Labs or RunPod for a cloud GPU test ($5–10 experiment) +- [ ] Benchmark single GPU throughput to establish a baseline for comparison diff --git a/__LOCAL_LLMs/windows_specific/capabilities/README.md b/__LOCAL_LLMs/windows_specific/capabilities/README.md new file mode 100644 index 00000000..72dd962d --- /dev/null +++ b/__LOCAL_LLMs/windows_specific/capabilities/README.md @@ -0,0 +1,52 @@ +# RTX 5090 Capabilities — Deep Dive Guides + +> What you can do with the Razer Blade 18's RTX 5090 (24 GB GDDR7) that you can't (or can't do well) on the Mac. + +Each guide covers: **what it is → how to use it → real-world use cases → benefits → skills you'll learn.** + +--- + +## Guides + +| # | Capability | Key Benefit | Skill Level | +| --------------------------------------- | --------------------------------- | ------------------------------------ | ------------ | +| [01](01-gpu-inference-speed.md) | **GPU Inference Speed** | 2–4× faster LLM responses | Beginner | +| [02](02-whisper-batch-transcription.md) | **Whisper Batch Transcription** | Hours of audio in minutes | Beginner | +| [03](03-tts-generation-at-scale.md) | **TTS Generation at Scale** | Faster-than-realtime voice synthesis | Beginner | +| [04](04-fine-tuning-training.md) | **Fine-Tuning / Training** | Customize models on your own data | Intermediate | +| [05](05-cuda-tensorrt-ml-research.md) | **CUDA / TensorRT / ML Research** | Full NVIDIA ML toolchain | Intermediate | +| [06](06-stable-diffusion-image-gen.md) | **Stable Diffusion / Image Gen** | 5–8s per image, unlimited free | Beginner | +| [07](07-multi-gpu-workloads.md) | **Multi-GPU Workloads (Future)** | Scaling path to production | Advanced | + +--- + +## Suggested Learning Order + +``` +Week 1: 01 (Inference) → 02 (Whisper) → 03 (TTS) + Get familiar with the GPU, benchmark your models + +Week 2: 06 (Stable Diffusion) + Set up ComfyUI, generate app assets + +Week 3: 04 (Fine-Tuning) + QLoRA your first 7B model on your own code + +Week 4: 05 (CUDA / TensorRT) + Deeper GPU programming, profiling, optimization + +Ongoing: 07 (Multi-GPU) + Reference as you plan scaling +``` + +--- + +## Prerequisites + +All guides assume you've completed the [Windows setup](../setup-guide.md): + +- NVIDIA drivers installed (Windows) +- Ollama installed and running (Windows) +- WSL2 Ubuntu 24.04 set up +- Repo cloned, `setup-tts.sh` completed +- Dashboard running at `http://localhost:3000`