Saravana Dhandapani b690719ff1 feat: transcript generation

2025-09-16 21:38:17 -07:00

4.8 KiB

Raw Blame History

🎬 YouTube Transcript to Numbered Files

Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (cc1.txt, cc2.txt, cc3.txt, etc.).

✅ What Was Fixed

The original script wrote all captions to a single captions.txt file. Now it:

Creates separate files for each caption segment
Numbers files sequentially: cc1.txt, cc2.txt, cc3.txt, etc.
Organizes output in a dedicated directory
Handles errors gracefully
Shows progress during processing

📁 Available Scripts

1. `transcribe_yt_video.py` (Fixed Original)

The minimal fixed version of your original script.

# Just change the video ID and run
video_id = "dQw4w9WgXcQ"  # Replace with your video ID

2. `enhanced_yt_transcript.py` (Recommended)

Full-featured script with command-line interface and error handling.

🚀 Usage

Quick Start (Fixed Original Script)

# Edit the video_id in the script, then run:
python transcribe_yt_video.py

Advanced Usage (Enhanced Script)

# Using video ID
python enhanced_yt_transcript.py dQw4w9WgXcQ

# Using full YouTube URL
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Custom output directory
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions

# Specify preferred languages
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr

# Get help
python enhanced_yt_transcript.py --help

📊 Output Structure

After running, you'll get:

captions/
├── cc1.txt          # First caption segment
├── cc2.txt          # Second caption segment  
├── cc3.txt          # Third caption segment
├── ...
├── cc150.txt        # Last segment (example)
└── summary.txt      # Summary information

Each cc#.txt file contains just the text from that caption segment.

🔧 Features

Fixed Original Script

✅ Separate files for each caption segment
✅ Sequential numbering (cc1.txt, cc2.txt, etc.)
✅ UTF-8 encoding for international characters
✅ Progress feedback showing what's being written

Enhanced Script

✅ Command-line interface - no need to edit code
✅ URL parsing - accepts YouTube URLs or video IDs
✅ Language selection - prefer specific languages
✅ Error handling - graceful failures with helpful messages
✅ Progress tracking - shows processing status
✅ Summary file - metadata about the download
✅ Directory cleanup - removes old files before new download

📋 Requirements

Install the required package:

pip install youtube-transcript-api

💡 Usage Examples

Example 1: Educational Video

python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes

Example 2: Multi-language Content

python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions

Example 3: Quick Processing

python enhanced_yt_transcript.py VIDEO_ID
# Creates captions/cc1.txt, captions/cc2.txt, etc.

🔍 Output Preview

When the script runs, you'll see:

🎬 Processing video ID: dQw4w9WgXcQ
✅ Found auto-generated or default transcript
📁 Created directory: captions
📝 Writing 156 segments...
  📄 cc10.txt: Never gonna give you up, never gonna let you...
  📄 cc20.txt: We've known each other for so long...
  📄 cc30.txt: Your heart's been aching but you're too shy...

🎉 Success!
📊 Total segments: 156
📁 Files saved in: /full/path/to/captions/
📋 Summary saved to: captions/summary.txt

🛠️ Troubleshooting

Common Issues:

"No transcript found"
- Video might not have captions/transcripts
- Try a different video with confirmed captions
"Transcripts are disabled"
- Video owner disabled transcripts
- Try a different video
Module not found
```
pip install youtube-transcript-api
```

Testing:

Use the test script to verify everything works:

python test_transcript.py

(Remember to replace TEST_VIDEO_ID with a real video ID)

📝 File Contents Example

cc1.txt:

Welcome to this tutorial

cc2.txt:

Today we'll be learning about

cc3.txt:

the basics of programming

summary.txt:

YouTube Video ID: dQw4w9WgXcQ
Total segments: 156
Files: cc1.txt to cc156.txt
Generated: enhanced_yt_transcript.py

🎯 Perfect For:

Content analysis - process each caption separately
AI training data - individual text segments
Research projects - granular transcript analysis
Content creation - extract specific quotes/segments
Translation work - process segments individually

The script is now fixed to write each caption segment to separate cc#.txt files as requested! 🎉

4.8 KiB Raw Blame History