feat: transcript generation

2025-09-16 21:38:17 -07:00 · 2025-09-16 21:38:17 -07:00 · b690719ff1
commit b690719ff1
parent 467e213b2e
7 changed files with 457 additions and 8 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,29 @@
+# YouTube transcript chunks
+cc*.txt
+
+# Python cache files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Virtual environment
+venv/
+env/
+ENV/
+
+# IDE files
+.vscode/
+.idea/
+*.swp
+*.swo
+
+# OS files
+.DS_Store
+Thumbs.db
+
+# Log files
+*.log
+
+# Temporary files
+*.tmp
+*.temp
--- a/youtube/README_youtube_transcripts.md
+++ b/youtube/README_youtube_transcripts.md
@ -0,0 +1,187 @@
+# 🎬 YouTube Transcript to Numbered Files
+
+Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
+
+## ✅ What Was Fixed
+
+The original script wrote all captions to a single `captions.txt` file. Now it:
+- **Creates separate files** for each caption segment
+- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
+- **Organizes output** in a dedicated directory
+- **Handles errors** gracefully
+- **Shows progress** during processing
+
+## 📁 Available Scripts
+
+### 1. `transcribe_yt_video.py` (Fixed Original)
+The minimal fixed version of your original script.
+
+```python
+# Just change the video ID and run
+video_id = "dQw4w9WgXcQ"  # Replace with your video ID
+```
+
+### 2. `enhanced_yt_transcript.py` (Recommended)
+Full-featured script with command-line interface and error handling.
+
+## 🚀 Usage
+
+### Quick Start (Fixed Original Script)
+```bash
+# Edit the video_id in the script, then run:
+python transcribe_yt_video.py
+```
+
+### Advanced Usage (Enhanced Script)
+```bash
+# Using video ID
+python enhanced_yt_transcript.py dQw4w9WgXcQ
+
+# Using full YouTube URL
+python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+
+# Custom output directory
+python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
+
+# Specify preferred languages
+python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
+
+# Get help
+python enhanced_yt_transcript.py --help
+```
+
+## 📊 Output Structure
+
+After running, you'll get:
+```
+captions/
+├── cc1.txt          # First caption segment
+├── cc2.txt          # Second caption segment  
+├── cc3.txt          # Third caption segment
+├── ...
+├── cc150.txt        # Last segment (example)
+└── summary.txt      # Summary information
+```
+
+Each `cc#.txt` file contains just the text from that caption segment.
+
+## 🔧 Features
+
+### Fixed Original Script
+- ✅ **Separate files** for each caption segment
+- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
+- ✅ **UTF-8 encoding** for international characters
+- ✅ **Progress feedback** showing what's being written
+
+### Enhanced Script  
+- ✅ **Command-line interface** - no need to edit code
+- ✅ **URL parsing** - accepts YouTube URLs or video IDs
+- ✅ **Language selection** - prefer specific languages
+- ✅ **Error handling** - graceful failures with helpful messages
+- ✅ **Progress tracking** - shows processing status
+- ✅ **Summary file** - metadata about the download
+- ✅ **Directory cleanup** - removes old files before new download
+
+## 📋 Requirements
+
+Install the required package:
+```bash
+pip install youtube-transcript-api
+```
+
+## 💡 Usage Examples
+
+### Example 1: Educational Video
+```bash
+python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
+```
+
+### Example 2: Multi-language Content
+```bash
+python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
+```
+
+### Example 3: Quick Processing
+```bash
+python enhanced_yt_transcript.py VIDEO_ID
+# Creates captions/cc1.txt, captions/cc2.txt, etc.
+```
+
+## 🔍 Output Preview
+
+When the script runs, you'll see:
+```
+🎬 Processing video ID: dQw4w9WgXcQ
+✅ Found auto-generated or default transcript
+📁 Created directory: captions
+📝 Writing 156 segments...
+  📄 cc10.txt: Never gonna give you up, never gonna let you...
+  📄 cc20.txt: We've known each other for so long...
+  📄 cc30.txt: Your heart's been aching but you're too shy...
+
+🎉 Success!
+📊 Total segments: 156
+📁 Files saved in: /full/path/to/captions/
+📋 Summary saved to: captions/summary.txt
+```
+
+## 🛠️ Troubleshooting
+
+### Common Issues:
+
+1. **"No transcript found"**
+   - Video might not have captions/transcripts
+   - Try a different video with confirmed captions
+
+2. **"Transcripts are disabled"**
+   - Video owner disabled transcripts
+   - Try a different video
+
+3. **Module not found**
+   ```bash
+   pip install youtube-transcript-api
+   ```
+
+### Testing:
+Use the test script to verify everything works:
+```bash
+python test_transcript.py
+```
+(Remember to replace `TEST_VIDEO_ID` with a real video ID)
+
+## 📝 File Contents Example
+
+**cc1.txt:**
+```
+Welcome to this tutorial
+```
+
+**cc2.txt:** 
+```
+Today we'll be learning about
+```
+
+**cc3.txt:**
+```
+the basics of programming
+```
+
+**summary.txt:**
+```
+YouTube Video ID: dQw4w9WgXcQ
+Total segments: 156
+Files: cc1.txt to cc156.txt
+Generated: enhanced_yt_transcript.py
+```
+
+## 🎯 Perfect For:
+
+- **Content analysis** - process each caption separately
+- **AI training data** - individual text segments
+- **Research projects** - granular transcript analysis
+- **Content creation** - extract specific quotes/segments
+- **Translation work** - process segments individually
+
+---
+
+**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉
--- a/youtube/captions/c7bbO_KSLPI_complete_transcript.txt
+++ b/youtube/captions/c7bbO_KSLPI_complete_transcript.txt
--- a/youtube/captions/complete_transcript.txt
+++ b/youtube/captions/complete_transcript.txt
--- a/youtube/enhanced_yt_transcript.py
+++ b/youtube/enhanced_yt_transcript.py
@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+"""
+Enhanced YouTube Transcript Downloader
+Downloads YouTube video transcripts and saves each segment to separate numbered files (cc1.txt, cc2.txt, etc.)
+"""
+
+import os
+import sys
+import argparse
+from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
+from urllib.parse import urlparse, parse_qs
+
+def extract_video_id(url_or_id):
+    """Extract video ID from YouTube URL or return ID if already provided"""
+    if len(url_or_id) == 11 and url_or_id.isalnum():
+        return url_or_id
+    
+    # Parse YouTube URL
+    parsed_url = urlparse(url_or_id)
+    
+    if 'youtube.com' in parsed_url.netloc:
+        return parse_qs(parsed_url.query).get('v', [None])[0]
+    elif 'youtu.be' in parsed_url.netloc:
+        return parsed_url.path[1:]
+    
+    return None
+
+def download_transcript(video_id, output_dir="captions", language_codes=None):
+    """Download transcript and save to numbered files"""
+    
+    try:
+        # Get available transcripts
+        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
+        
+        # Try to get transcript in preferred language or auto-generated
+        transcript = None
+        if language_codes:
+            for lang in language_codes:
+                try:
+                    transcript = transcript_list.find_transcript([lang]).fetch()
+                    print(f"✅ Found transcript in language: {lang}")
+                    break
+                except NoTranscriptFound:
+                    continue
+        
+        if not transcript:
+            # Get any available transcript
+            try:
+                transcript = YouTubeTranscriptApi.get_transcript(video_id)
+                print("✅ Found auto-generated or default transcript")
+            except NoTranscriptFound:
+                print("❌ No transcript found for this video")
+                return False
+        
+        # Create output directory and chunks subdirectory
+        chunks_dir = os.path.join(output_dir, "chunks")
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+            print(f"📁 Created directory: {output_dir}")
+        if not os.path.exists(chunks_dir):
+            os.makedirs(chunks_dir)
+            print(f"📁 Created chunks directory: {chunks_dir}")
+        
+        # Clear existing files in chunks directory
+        for filename in os.listdir(chunks_dir):
+            if filename.startswith("cc") and filename.endswith(".txt"):
+                os.remove(os.path.join(chunks_dir, filename))
+        
+        # Write each segment to separate files
+        print(f"📝 Writing {len(transcript)} segments...")
+        
+        for i, entry in enumerate(transcript, 1):
+            filename = f"cc{i}.txt"
+            filepath = os.path.join(chunks_dir, filename)
+            
+            with open(filepath, "w", encoding="utf-8") as f:
+                f.write(entry['text'])
+            
+            # Show progress for every 10th file or if text is interesting
+            if i % 10 == 0 or len(entry['text']) > 50:
+                preview = entry['text'][:50] + "..." if len(entry['text']) > 50 else entry['text']
+                print(f"  📄 {filename}: {preview}")
+        
+        # Create complete transcript file with YouTube ID in filename
+        complete_filename = f"{video_id}_complete_transcript.txt"
+        complete_filepath = os.path.join(output_dir, complete_filename)
+        
+        # Combine all chunks into single file
+        with open(complete_filepath, "w", encoding="utf-8") as f:
+            for i in range(1, len(transcript) + 1):
+                chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
+                if os.path.exists(chunk_file):
+                    with open(chunk_file, "r", encoding="utf-8") as chunk_f:
+                        f.write(chunk_f.read())
+        
+        print(f"\n🎉 Success!")
+        print(f"📊 Total segments: {len(transcript)}")
+        print(f"📁 Individual files saved in: {os.path.abspath(chunks_dir)}/")
+        print(f"📄 Complete transcript saved as: {complete_filename}")
+        
+        # Create a summary file
+        summary_path = os.path.join(output_dir, "summary.txt")
+        with open(summary_path, "w", encoding="utf-8") as f:
+            f.write(f"YouTube Video ID: {video_id}\n")
+            f.write(f"Total segments: {len(transcript)}\n")
+            f.write(f"Files: chunks/cc1.txt to chunks/cc{len(transcript)}.txt\n")
+            f.write(f"Complete transcript: {complete_filename}\n")
+            f.write(f"Generated: {os.path.basename(__file__)}\n")
+        
+        print(f"📋 Summary saved to: {summary_path}")
+        return True
+        
+    except TranscriptsDisabled:
+        print("❌ Transcripts are disabled for this video")
+        return False
+    except NoTranscriptFound:
+        print("❌ No transcript found for this video")
+        return False
+    except Exception as e:
+        print(f"❌ Error: {str(e)}")
+        return False
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Download YouTube transcripts to numbered caption files",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  %(prog)s dQw4w9WgXcQ
+  %(prog)s "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+  %(prog)s dQw4w9WgXcQ --output my_captions --languages en es fr
+        """
+    )
+    
+    parser.add_argument(
+        "video",
+        help="YouTube video ID or URL"
+    )
+    
+    parser.add_argument(
+        "--output", "-o",
+        default="captions",
+        help="Output directory for caption files (default: captions)"
+    )
+    
+    parser.add_argument(
+        "--languages", "-l",
+        nargs="*",
+        default=["en"],
+        help="Preferred language codes (e.g., en es fr) - default: en"
+    )
+    
+    args = parser.parse_args()
+    
+    # Extract video ID
+    video_id = extract_video_id(args.video)
+    if not video_id:
+        print("❌ Invalid YouTube URL or video ID")
+        print("Example formats:")
+        print("  Video ID: dQw4w9WgXcQ")
+        print("  Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ")
+        print("  Short URL: https://youtu.be/dQw4w9WgXcQ")
+        sys.exit(1)
+    
+    print(f"🎬 Processing video ID: {video_id}")
+    
+    # Download transcript
+    success = download_transcript(video_id, args.output, args.languages)
+    
+    if not success:
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
--- a/youtube/test_transcript.py
+++ b/youtube/test_transcript.py
@ -0,0 +1,24 @@
+#!/usr/bin/env python3
+"""
+Test script to demonstrate the fixed CC text functionality
+Replace 'TEST_VIDEO_ID' with an actual YouTube video ID that has captions
+"""
+
+from enhanced_yt_transcript import download_transcript
+
+# Example video ID - replace with a real video ID that has captions
+video_id = "TEST_VIDEO_ID"  # Replace this with actual video ID
+
+print("🎬 Testing YouTube transcript downloader...")
+print(f"📹 Video ID: {video_id}")
+print("📝 This will create cc1.txt, cc2.txt, cc3.txt, etc.")
+print()
+
+# Test the function
+success = download_transcript(video_id, output_dir="test_captions")
+
+if success:
+    print("\n✅ Test completed successfully!")
+    print("Check the 'test_captions' directory for cc#.txt files")
+else:
+    print("\n❌ Test failed - make sure to use a valid video ID with captions")
--- a/youtube/transcribe_yt_video.py
+++ b/youtube/transcribe_yt_video.py
@ -1,24 +1,57 @@
 from youtube_transcript_api import YouTubeTranscriptApi
 import os

-video_id = "VIDEO_ID_HERE"  # e.g., 'dQw4w9WgXcQ'
-transcript = YouTubeTranscriptApi.get_transcript(video_id)
+video_id = "c7bbO_KSLPI"  # Video ID from the provided URL

-# Create output directory if it doesn't exist
+# Create API instance and get transcript
+ytt_api = YouTubeTranscriptApi()
+transcript_list = ytt_api.list(video_id)
+
+# Try to get transcript in Korean (available for this video)
+try:
+    transcript = transcript_list.find_transcript(['ko']).fetch()
+    print("✅ Found Korean transcript")
+except:
+    # Get any available transcript
+    transcript = transcript_list.find_generated_transcript(['ko']).fetch()
+    print("✅ Found Korean auto-generated transcript")
+
+# Create output directory and chunks subdirectory if they don't exist
 output_dir = "captions"
+chunks_dir = os.path.join(output_dir, "chunks")
 if not os.path.exists(output_dir):
    os.makedirs(output_dir)
+if not os.path.exists(chunks_dir):
+    os.makedirs(chunks_dir)

-# Write each caption segment to separate numbered files
+# Clear existing files in chunks directory
+for filename in os.listdir(chunks_dir):
+    if filename.startswith("cc") and filename.endswith(".txt"):
+        os.remove(os.path.join(chunks_dir, filename))
+
+# Write each caption segment to separate numbered files in chunks folder
 for i, entry in enumerate(transcript, 1):
    filename = f"cc{i}.txt"
-    filepath = os.path.join(output_dir, filename)
+    filepath = os.path.join(chunks_dir, filename)
    
    with open(filepath, "w", encoding="utf-8") as f:
-        f.write(entry['text'])
+        f.write(entry.text)
    
-    print(f"Written: {filename} - {entry['text'][:50]}...")
+    print(f"Written: {filename} - {entry.text[:50]}...")
+
+# Create complete transcript file with YouTube ID in filename
+complete_filename = f"{video_id}_complete_transcript.txt"
+complete_filepath = os.path.join(output_dir, complete_filename)
+
+# Combine all chunks into single file
+with open(complete_filepath, "w", encoding="utf-8") as f:
+    for i in range(1, len(transcript) + 1):
+        chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
+        if os.path.exists(chunk_file):
+            with open(chunk_file, "r", encoding="utf-8") as chunk_f:
+                f.write(chunk_f.read())

 print(f"\nTotal segments: {len(transcript)}")
-print(f"Files saved in: {output_dir}/")
+print(f"Individual files saved in: {chunks_dir}/")
+print(f"Complete transcript saved as: {complete_filename}")