bytelyst-devops-tools/youtube/README_youtube_transcripts.md

4.8 KiB

🎬 YouTube Transcript to Numbered Files

Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (cc1.txt, cc2.txt, cc3.txt, etc.).

What Was Fixed

The original script wrote all captions to a single captions.txt file. Now it:

  • Creates separate files for each caption segment
  • Numbers files sequentially: cc1.txt, cc2.txt, cc3.txt, etc.
  • Organizes output in a dedicated directory
  • Handles errors gracefully
  • Shows progress during processing

📁 Available Scripts

1. transcribe_yt_video.py (Fixed Original)

The minimal fixed version of your original script.

# Just change the video ID and run
video_id = "dQw4w9WgXcQ"  # Replace with your video ID

Full-featured script with command-line interface and error handling.

🚀 Usage

Quick Start (Fixed Original Script)

# Edit the video_id in the script, then run:
python transcribe_yt_video.py

Advanced Usage (Enhanced Script)

# Using video ID
python enhanced_yt_transcript.py dQw4w9WgXcQ

# Using full YouTube URL
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Custom output directory
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions

# Specify preferred languages
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr

# Get help
python enhanced_yt_transcript.py --help

📊 Output Structure

After running, you'll get:

captions/
├── cc1.txt          # First caption segment
├── cc2.txt          # Second caption segment  
├── cc3.txt          # Third caption segment
├── ...
├── cc150.txt        # Last segment (example)
└── summary.txt      # Summary information

Each cc#.txt file contains just the text from that caption segment.

🔧 Features

Fixed Original Script

  • Separate files for each caption segment
  • Sequential numbering (cc1.txt, cc2.txt, etc.)
  • UTF-8 encoding for international characters
  • Progress feedback showing what's being written

Enhanced Script

  • Command-line interface - no need to edit code
  • URL parsing - accepts YouTube URLs or video IDs
  • Language selection - prefer specific languages
  • Error handling - graceful failures with helpful messages
  • Progress tracking - shows processing status
  • Summary file - metadata about the download
  • Directory cleanup - removes old files before new download

📋 Requirements

Install the required package:

pip install youtube-transcript-api

💡 Usage Examples

Example 1: Educational Video

python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes

Example 2: Multi-language Content

python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions

Example 3: Quick Processing

python enhanced_yt_transcript.py VIDEO_ID
# Creates captions/cc1.txt, captions/cc2.txt, etc.

🔍 Output Preview

When the script runs, you'll see:

🎬 Processing video ID: dQw4w9WgXcQ
✅ Found auto-generated or default transcript
📁 Created directory: captions
📝 Writing 156 segments...
  📄 cc10.txt: Never gonna give you up, never gonna let you...
  📄 cc20.txt: We've known each other for so long...
  📄 cc30.txt: Your heart's been aching but you're too shy...

🎉 Success!
📊 Total segments: 156
📁 Files saved in: /full/path/to/captions/
📋 Summary saved to: captions/summary.txt

🛠️ Troubleshooting

Common Issues:

  1. "No transcript found"

    • Video might not have captions/transcripts
    • Try a different video with confirmed captions
  2. "Transcripts are disabled"

    • Video owner disabled transcripts
    • Try a different video
  3. Module not found

    pip install youtube-transcript-api
    

Testing:

Use the test script to verify everything works:

python test_transcript.py

(Remember to replace TEST_VIDEO_ID with a real video ID)

📝 File Contents Example

cc1.txt:

Welcome to this tutorial

cc2.txt:

Today we'll be learning about

cc3.txt:

the basics of programming

summary.txt:

YouTube Video ID: dQw4w9WgXcQ
Total segments: 156
Files: cc1.txt to cc156.txt
Generated: enhanced_yt_transcript.py

🎯 Perfect For:

  • Content analysis - process each caption separately
  • AI training data - individual text segments
  • Research projects - granular transcript analysis
  • Content creation - extract specific quotes/segments
  • Translation work - process segments individually

The script is now fixed to write each caption segment to separate cc#.txt files as requested! 🎉