feat: transcript generation

2025-09-16 21:38:17 -07:00 · 2025-09-16 21:38:17 -07:00 · b690719ff1
commit b690719ff1
parent 467e213b2e
7 changed files with 457 additions and 8 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,29 @@
 # YouTube transcript chunks
 cc*.txt
 # Python cache files
 __pycache__/
 *.py[cod]
 *$py.class
 # Virtual environment
 venv/
 env/
 ENV/
 # IDE files
 .vscode/
 .idea/
 *.swp
 *.swo
 # OS files
 .DS_Store
 Thumbs.db
 # Log files
 *.log
 # Temporary files
 *.tmp
 *.temp
--- a/youtube/README_youtube_transcripts.md
+++ b/youtube/README_youtube_transcripts.md
@ -0,0 +1,187 @@
 # 🎬 YouTube Transcript to Numbered Files
 Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
 ## ✅ What Was Fixed
 The original script wrote all captions to a single `captions.txt` file. Now it:
 - **Creates separate files** for each caption segment
 - **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
 - **Organizes output** in a dedicated directory
 - **Handles errors** gracefully
 - **Shows progress** during processing
 ## 📁 Available Scripts
 ### 1. `transcribe_yt_video.py` (Fixed Original)
 The minimal fixed version of your original script.
 ```python
 # Just change the video ID and run
 video_id = "dQw4w9WgXcQ"  # Replace with your video ID
 ```
 ### 2. `enhanced_yt_transcript.py` (Recommended)
 Full-featured script with command-line interface and error handling.
 ## 🚀 Usage
 ### Quick Start (Fixed Original Script)
 ```bash
 # Edit the video_id in the script, then run:
 python transcribe_yt_video.py
 ```
 ### Advanced Usage (Enhanced Script)
 ```bash
 # Using video ID
 python enhanced_yt_transcript.py dQw4w9WgXcQ
 # Using full YouTube URL
 python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
 # Custom output directory
 python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
 # Specify preferred languages
 python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
 # Get help
 python enhanced_yt_transcript.py --help
 ```
 ## 📊 Output Structure
 After running, you'll get:
 ```
 captions/
 ├── cc1.txt          # First caption segment
 ├── cc2.txt          # Second caption segment  
 ├── cc3.txt          # Third caption segment
 ├── ...
 ├── cc150.txt        # Last segment (example)
 └── summary.txt      # Summary information
 ```
 Each `cc#.txt` file contains just the text from that caption segment.
 ## 🔧 Features
 ### Fixed Original Script
 - ✅ **Separate files** for each caption segment
 - ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
 - ✅ **UTF-8 encoding** for international characters
 - ✅ **Progress feedback** showing what's being written
 ### Enhanced Script  
 - ✅ **Command-line interface** - no need to edit code
 - ✅ **URL parsing** - accepts YouTube URLs or video IDs
 - ✅ **Language selection** - prefer specific languages
 - ✅ **Error handling** - graceful failures with helpful messages
 - ✅ **Progress tracking** - shows processing status
 - ✅ **Summary file** - metadata about the download
 - ✅ **Directory cleanup** - removes old files before new download
 ## 📋 Requirements
 Install the required package:
 ```bash
 pip install youtube-transcript-api
 ```
 ## 💡 Usage Examples
 ### Example 1: Educational Video
 ```bash
 python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
 ```
 ### Example 2: Multi-language Content
 ```bash
 python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
 ```
 ### Example 3: Quick Processing
 ```bash
 python enhanced_yt_transcript.py VIDEO_ID
 # Creates captions/cc1.txt, captions/cc2.txt, etc.
 ```
 ## 🔍 Output Preview
 When the script runs, you'll see:
 ```
 🎬 Processing video ID: dQw4w9WgXcQ
 ✅ Found auto-generated or default transcript
 📁 Created directory: captions
 📝 Writing 156 segments...
  📄 cc10.txt: Never gonna give you up, never gonna let you...
  📄 cc20.txt: We've known each other for so long...
  📄 cc30.txt: Your heart's been aching but you're too shy...
 🎉 Success!
 📊 Total segments: 156
 📁 Files saved in: /full/path/to/captions/
 📋 Summary saved to: captions/summary.txt
 ```
 ## 🛠️ Troubleshooting
 ### Common Issues:
 1. **"No transcript found"**
   - Video might not have captions/transcripts
   - Try a different video with confirmed captions
 2. **"Transcripts are disabled"**
   - Video owner disabled transcripts
   - Try a different video
 3. **Module not found**
   ```bash
   pip install youtube-transcript-api
   ```
 ### Testing:
 Use the test script to verify everything works:
 ```bash
 python test_transcript.py
 ```
 (Remember to replace `TEST_VIDEO_ID` with a real video ID)
 ## 📝 File Contents Example
 **cc1.txt:**
 ```
 Welcome to this tutorial
 ```
 **cc2.txt:** 
 ```
 Today we'll be learning about
 ```
 **cc3.txt:**
 ```
 the basics of programming
 ```
 **summary.txt:**
 ```
 YouTube Video ID: dQw4w9WgXcQ
 Total segments: 156
 Files: cc1.txt to cc156.txt
 Generated: enhanced_yt_transcript.py
 ```
 ## 🎯 Perfect For:
 - **Content analysis** - process each caption separately
 - **AI training data** - individual text segments
 - **Research projects** - granular transcript analysis
 - **Content creation** - extract specific quotes/segments
 - **Translation work** - process segments individually
 ---
 **The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉
--- a/youtube/captions/c7bbO_KSLPI_complete_transcript.txt
+++ b/youtube/captions/c7bbO_KSLPI_complete_transcript.txt
--- a/youtube/captions/complete_transcript.txt
+++ b/youtube/captions/complete_transcript.txt
--- a/youtube/enhanced_yt_transcript.py
+++ b/youtube/enhanced_yt_transcript.py
@ -0,0 +1,174 @@
 #!/usr/bin/env python3
 """
 Enhanced YouTube Transcript Downloader
 Downloads YouTube video transcripts and saves each segment to separate numbered files (cc1.txt, cc2.txt, etc.)
 """
 import os
 import sys
 import argparse
 from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
 from urllib.parse import urlparse, parse_qs
 def extract_video_id(url_or_id):
    """Extract video ID from YouTube URL or return ID if already provided"""
    if len(url_or_id) == 11 and url_or_id.isalnum():
        return url_or_id
    # Parse YouTube URL
    parsed_url = urlparse(url_or_id)
    if 'youtube.com' in parsed_url.netloc:
        return parse_qs(parsed_url.query).get('v', [None])[0]
    elif 'youtu.be' in parsed_url.netloc:
        return parsed_url.path[1:]
    return None
 def download_transcript(video_id, output_dir="captions", language_codes=None):
    """Download transcript and save to numbered files"""
    try:
        # Get available transcripts
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
        # Try to get transcript in preferred language or auto-generated
        transcript = None
        if language_codes:
            for lang in language_codes:
                try:
                    transcript = transcript_list.find_transcript([lang]).fetch()
                    print(f"✅ Found transcript in language: {lang}")
                    break
                except NoTranscriptFound:
                    continue
        if not transcript:
            # Get any available transcript
            try:
                transcript = YouTubeTranscriptApi.get_transcript(video_id)
                print("✅ Found auto-generated or default transcript")
            except NoTranscriptFound:
                print("❌ No transcript found for this video")
                return False
        # Create output directory and chunks subdirectory
        chunks_dir = os.path.join(output_dir, "chunks")
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
            print(f"📁 Created directory: {output_dir}")
        if not os.path.exists(chunks_dir):
            os.makedirs(chunks_dir)
            print(f"📁 Created chunks directory: {chunks_dir}")
        # Clear existing files in chunks directory
        for filename in os.listdir(chunks_dir):
            if filename.startswith("cc") and filename.endswith(".txt"):
                os.remove(os.path.join(chunks_dir, filename))
        # Write each segment to separate files
        print(f"📝 Writing {len(transcript)} segments...")
        for i, entry in enumerate(transcript, 1):
            filename = f"cc{i}.txt"
            filepath = os.path.join(chunks_dir, filename)
            with open(filepath, "w", encoding="utf-8") as f:
                f.write(entry['text'])
            # Show progress for every 10th file or if text is interesting
            if i % 10 == 0 or len(entry['text']) > 50:
                preview = entry['text'][:50] + "..." if len(entry['text']) > 50 else entry['text']
                print(f"  📄 {filename}: {preview}")
        # Create complete transcript file with YouTube ID in filename
        complete_filename = f"{video_id}_complete_transcript.txt"
        complete_filepath = os.path.join(output_dir, complete_filename)
        # Combine all chunks into single file
        with open(complete_filepath, "w", encoding="utf-8") as f:
            for i in range(1, len(transcript) + 1):
                chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
                if os.path.exists(chunk_file):
                    with open(chunk_file, "r", encoding="utf-8") as chunk_f:
                        f.write(chunk_f.read())
        print(f"\n🎉 Success!")
        print(f"📊 Total segments: {len(transcript)}")
        print(f"📁 Individual files saved in: {os.path.abspath(chunks_dir)}/")
        print(f"📄 Complete transcript saved as: {complete_filename}")
        # Create a summary file
        summary_path = os.path.join(output_dir, "summary.txt")
        with open(summary_path, "w", encoding="utf-8") as f:
            f.write(f"YouTube Video ID: {video_id}\n")
            f.write(f"Total segments: {len(transcript)}\n")
            f.write(f"Files: chunks/cc1.txt to chunks/cc{len(transcript)}.txt\n")
            f.write(f"Complete transcript: {complete_filename}\n")
            f.write(f"Generated: {os.path.basename(__file__)}\n")
        print(f"📋 Summary saved to: {summary_path}")
        return True
    except TranscriptsDisabled:
        print("❌ Transcripts are disabled for this video")
        return False
    except NoTranscriptFound:
        print("❌ No transcript found for this video")
        return False
    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return False
 def main():
    parser = argparse.ArgumentParser(
        description="Download YouTube transcripts to numbered caption files",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  %(prog)s dQw4w9WgXcQ
  %(prog)s "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
  %(prog)s dQw4w9WgXcQ --output my_captions --languages en es fr
        """
    )
    parser.add_argument(
        "video",
        help="YouTube video ID or URL"
    )
    parser.add_argument(
        "--output", "-o",
        default="captions",
        help="Output directory for caption files (default: captions)"
    )
    parser.add_argument(
        "--languages", "-l",
        nargs="*",
        default=["en"],
        help="Preferred language codes (e.g., en es fr) - default: en"
    )
    args = parser.parse_args()
    # Extract video ID
    video_id = extract_video_id(args.video)
    if not video_id:
        print("❌ Invalid YouTube URL or video ID")
        print("Example formats:")
        print("  Video ID: dQw4w9WgXcQ")
        print("  Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ")
        print("  Short URL: https://youtu.be/dQw4w9WgXcQ")
        sys.exit(1)
    print(f"🎬 Processing video ID: {video_id}")
    # Download transcript
    success = download_transcript(video_id, args.output, args.languages)
    if not success:
        sys.exit(1)
 if __name__ == "__main__":
    main()
--- a/youtube/test_transcript.py
+++ b/youtube/test_transcript.py
@ -0,0 +1,24 @@
 #!/usr/bin/env python3
 """
 Test script to demonstrate the fixed CC text functionality
 Replace 'TEST_VIDEO_ID' with an actual YouTube video ID that has captions
 """
 from enhanced_yt_transcript import download_transcript
 # Example video ID - replace with a real video ID that has captions
 video_id = "TEST_VIDEO_ID"  # Replace this with actual video ID
 print("🎬 Testing YouTube transcript downloader...")
 print(f"📹 Video ID: {video_id}")
 print("📝 This will create cc1.txt, cc2.txt, cc3.txt, etc.")
 print()
 # Test the function
 success = download_transcript(video_id, output_dir="test_captions")
 if success:
    print("\n✅ Test completed successfully!")
    print("Check the 'test_captions' directory for cc#.txt files")
 else:
    print("\n❌ Test failed - make sure to use a valid video ID with captions")
--- a/youtube/transcribe_yt_video.py
+++ b/youtube/transcribe_yt_video.py
@ -1,24 +1,57 @@
 from youtube_transcript_api import YouTubeTranscriptApi
 import os
-video_id = "VIDEO_ID_HERE"  # e.g., 'dQw4w9WgXcQ'
+video_id = "c7bbO_KSLPI"  # Video ID from the provided URL
 transcript = YouTubeTranscriptApi.get_transcript(video_id)
-# Create output directory if it doesn't exist
+# Create API instance and get transcript
 ytt_api = YouTubeTranscriptApi()
 transcript_list = ytt_api.list(video_id)
 # Try to get transcript in Korean (available for this video)
 try:
    transcript = transcript_list.find_transcript(['ko']).fetch()
    print("✅ Found Korean transcript")
 except:
    # Get any available transcript
    transcript = transcript_list.find_generated_transcript(['ko']).fetch()
    print("✅ Found Korean auto-generated transcript")
 # Create output directory and chunks subdirectory if they don't exist
 output_dir = "captions"
 chunks_dir = os.path.join(output_dir, "chunks")
 if not os.path.exists(output_dir):
    os.makedirs(output_dir)
 if not os.path.exists(chunks_dir):
    os.makedirs(chunks_dir)
-# Write each caption segment to separate numbered files
+# Clear existing files in chunks directory
 for filename in os.listdir(chunks_dir):
    if filename.startswith("cc") and filename.endswith(".txt"):
        os.remove(os.path.join(chunks_dir, filename))
 # Write each caption segment to separate numbered files in chunks folder
 for i, entry in enumerate(transcript, 1):
    filename = f"cc{i}.txt"
-    filepath = os.path.join(output_dir, filename)
+    filepath = os.path.join(chunks_dir, filename)
    with open(filepath, "w", encoding="utf-8") as f:
-        f.write(entry['text'])
+        f.write(entry.text)
-    print(f"Written: {filename} - {entry['text'][:50]}...")
+    print(f"Written: {filename} - {entry.text[:50]}...")
 # Create complete transcript file with YouTube ID in filename
 complete_filename = f"{video_id}_complete_transcript.txt"
 complete_filepath = os.path.join(output_dir, complete_filename)
 # Combine all chunks into single file
 with open(complete_filepath, "w", encoding="utf-8") as f:
    for i in range(1, len(transcript) + 1):
        chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
        if os.path.exists(chunk_file):
            with open(chunk_file, "r", encoding="utf-8") as chunk_f:
                f.write(chunk_f.read())
 print(f"\nTotal segments: {len(transcript)}")
-print(f"Files saved in: {output_dir}/")
+print(f"Individual files saved in: {chunks_dir}/")
 print(f"Complete transcript saved as: {complete_filename}")