bytelyst-devops-tools/youtube/README_youtube_transcripts.md

187 lines
4.8 KiB
Markdown

# 🎬 YouTube Transcript to Numbered Files
Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
## ✅ What Was Fixed
The original script wrote all captions to a single `captions.txt` file. Now it:
- **Creates separate files** for each caption segment
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
- **Organizes output** in a dedicated directory
- **Handles errors** gracefully
- **Shows progress** during processing
## 📁 Available Scripts
### 1. `transcribe_yt_video.py` (Fixed Original)
The minimal fixed version of your original script.
```python
# Just change the video ID and run
video_id = "dQw4w9WgXcQ" # Replace with your video ID
```
### 2. `enhanced_yt_transcript.py` (Recommended)
Full-featured script with command-line interface and error handling.
## 🚀 Usage
### Quick Start (Fixed Original Script)
```bash
# Edit the video_id in the script, then run:
python transcribe_yt_video.py
```
### Advanced Usage (Enhanced Script)
```bash
# Using video ID
python enhanced_yt_transcript.py dQw4w9WgXcQ
# Using full YouTube URL
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Custom output directory
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
# Specify preferred languages
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
# Get help
python enhanced_yt_transcript.py --help
```
## 📊 Output Structure
After running, you'll get:
```
captions/
├── cc1.txt # First caption segment
├── cc2.txt # Second caption segment
├── cc3.txt # Third caption segment
├── ...
├── cc150.txt # Last segment (example)
└── summary.txt # Summary information
```
Each `cc#.txt` file contains just the text from that caption segment.
## 🔧 Features
### Fixed Original Script
-**Separate files** for each caption segment
-**Sequential numbering** (cc1.txt, cc2.txt, etc.)
-**UTF-8 encoding** for international characters
-**Progress feedback** showing what's being written
### Enhanced Script
-**Command-line interface** - no need to edit code
-**URL parsing** - accepts YouTube URLs or video IDs
-**Language selection** - prefer specific languages
-**Error handling** - graceful failures with helpful messages
-**Progress tracking** - shows processing status
-**Summary file** - metadata about the download
-**Directory cleanup** - removes old files before new download
## 📋 Requirements
Install the required package:
```bash
pip install youtube-transcript-api
```
## 💡 Usage Examples
### Example 1: Educational Video
```bash
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
```
### Example 2: Multi-language Content
```bash
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
```
### Example 3: Quick Processing
```bash
python enhanced_yt_transcript.py VIDEO_ID
# Creates captions/cc1.txt, captions/cc2.txt, etc.
```
## 🔍 Output Preview
When the script runs, you'll see:
```
🎬 Processing video ID: dQw4w9WgXcQ
✅ Found auto-generated or default transcript
📁 Created directory: captions
📝 Writing 156 segments...
📄 cc10.txt: Never gonna give you up, never gonna let you...
📄 cc20.txt: We've known each other for so long...
📄 cc30.txt: Your heart's been aching but you're too shy...
🎉 Success!
📊 Total segments: 156
📁 Files saved in: /full/path/to/captions/
📋 Summary saved to: captions/summary.txt
```
## 🛠️ Troubleshooting
### Common Issues:
1. **"No transcript found"**
- Video might not have captions/transcripts
- Try a different video with confirmed captions
2. **"Transcripts are disabled"**
- Video owner disabled transcripts
- Try a different video
3. **Module not found**
```bash
pip install youtube-transcript-api
```
### Testing:
Use the test script to verify everything works:
```bash
python test_transcript.py
```
(Remember to replace `TEST_VIDEO_ID` with a real video ID)
## 📝 File Contents Example
**cc1.txt:**
```
Welcome to this tutorial
```
**cc2.txt:**
```
Today we'll be learning about
```
**cc3.txt:**
```
the basics of programming
```
**summary.txt:**
```
YouTube Video ID: dQw4w9WgXcQ
Total segments: 156
Files: cc1.txt to cc156.txt
Generated: enhanced_yt_transcript.py
```
## 🎯 Perfect For:
- **Content analysis** - process each caption separately
- **AI training data** - individual text segments
- **Research projects** - granular transcript analysis
- **Content creation** - extract specific quotes/segments
- **Translation work** - process segments individually
---
**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉