bytelyst-devops-tools/youtube/README_youtube_transcripts.md

# 🎬 YouTube Transcript to Numbered Files

Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).

## ✅ What Was Fixed

The original script wrote all captions to a single `captions.txt` file. Now it:
- **Creates separate files** for each caption segment
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
- **Organizes output** in a dedicated directory
- **Handles errors** gracefully
- **Shows progress** during processing

## 📁 Available Scripts

### 1. `transcribe_yt_video.py` (Fixed Original)
The minimal fixed version of your original script.

```python
# Just change the video ID and run
video_id = "dQw4w9WgXcQ"  # Replace with your video ID
```

### 2. `enhanced_yt_transcript.py` (Recommended)
Full-featured script with command-line interface and error handling.

## 🚀 Usage

### Quick Start (Fixed Original Script)
```bash
# Edit the video_id in the script, then run:
python transcribe_yt_video.py
```

### Advanced Usage (Enhanced Script)
```bash
# Using video ID
python enhanced_yt_transcript.py dQw4w9WgXcQ

# Using full YouTube URL
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Custom output directory
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions

# Specify preferred languages
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr

# Get help
python enhanced_yt_transcript.py --help
```

## 📊 Output Structure

After running, you'll get:
```
captions/
├── cc1.txt          # First caption segment
├── cc2.txt          # Second caption segment
├── cc3.txt          # Third caption segment
├── ...
├── cc150.txt        # Last segment (example)
└── summary.txt      # Summary information
```

Each `cc#.txt` file contains just the text from that caption segment.

## 🔧 Features

### Fixed Original Script
- ✅ **Separate files** for each caption segment
- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
- ✅ **UTF-8 encoding** for international characters
- ✅ **Progress feedback** showing what's being written

### Enhanced Script
- ✅ **Command-line interface** - no need to edit code
- ✅ **URL parsing** - accepts YouTube URLs or video IDs
- ✅ **Language selection** - prefer specific languages
- ✅ **Error handling** - graceful failures with helpful messages
- ✅ **Progress tracking** - shows processing status
- ✅ **Summary file** - metadata about the download
- ✅ **Directory cleanup** - removes old files before new download

## 📋 Requirements

Install the required package:
```bash
pip install youtube-transcript-api
```

## 💡 Usage Examples

### Example 1: Educational Video
```bash
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
```

### Example 2: Multi-language Content
```bash
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
```

### Example 3: Quick Processing
```bash
python enhanced_yt_transcript.py VIDEO_ID
# Creates captions/cc1.txt, captions/cc2.txt, etc.
```

## 🔍 Output Preview

When the script runs, you'll see:
```
🎬 Processing video ID: dQw4w9WgXcQ
✅ Found auto-generated or default transcript
📁 Created directory: captions
📝 Writing 156 segments...
  📄 cc10.txt: Never gonna give you up, never gonna let you...
  📄 cc20.txt: We've known each other for so long...
  📄 cc30.txt: Your heart's been aching but you're too shy...

🎉 Success!
📊 Total segments: 156
📁 Files saved in: /full/path/to/captions/
📋 Summary saved to: captions/summary.txt
```

## 🛠️ Troubleshooting

### Common Issues:

1. **"No transcript found"**
   - Video might not have captions/transcripts
   - Try a different video with confirmed captions

2. **"Transcripts are disabled"**
   - Video owner disabled transcripts
   - Try a different video

3. **Module not found**
   ```bash
   pip install youtube-transcript-api
   ```

### Testing:
Use the test script to verify everything works:
```bash
python test_transcript.py
```
(Remember to replace `TEST_VIDEO_ID` with a real video ID)

## 📝 File Contents Example

**cc1.txt:**
```
Welcome to this tutorial
```

**cc2.txt:**
```
Today we'll be learning about
```

**cc3.txt:**
```
the basics of programming
```

**summary.txt:**
```
YouTube Video ID: dQw4w9WgXcQ
Total segments: 156
Files: cc1.txt to cc156.txt
Generated: enhanced_yt_transcript.py
```

## 🎯 Perfect For:

- **Content analysis** - process each caption separately
- **AI training data** - individual text segments
- **Research projects** - granular transcript analysis
- **Content creation** - extract specific quotes/segments
- **Translation work** - process segments individually

---

**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉