feat: transcript generation

This commit is contained in:
Saravana Dhandapani 2025-09-16 21:38:17 -07:00
parent 467e213b2e
commit b690719ff1
7 changed files with 457 additions and 8 deletions

29
.gitignore vendored Normal file
View File

@ -0,0 +1,29 @@
# YouTube transcript chunks
cc*.txt
# Python cache files
__pycache__/
*.py[cod]
*$py.class
# Virtual environment
venv/
env/
ENV/
# IDE files
.vscode/
.idea/
*.swp
*.swo
# OS files
.DS_Store
Thumbs.db
# Log files
*.log
# Temporary files
*.tmp
*.temp

View File

@ -0,0 +1,187 @@
# 🎬 YouTube Transcript to Numbered Files
Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
## ✅ What Was Fixed
The original script wrote all captions to a single `captions.txt` file. Now it:
- **Creates separate files** for each caption segment
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
- **Organizes output** in a dedicated directory
- **Handles errors** gracefully
- **Shows progress** during processing
## 📁 Available Scripts
### 1. `transcribe_yt_video.py` (Fixed Original)
The minimal fixed version of your original script.
```python
# Just change the video ID and run
video_id = "dQw4w9WgXcQ" # Replace with your video ID
```
### 2. `enhanced_yt_transcript.py` (Recommended)
Full-featured script with command-line interface and error handling.
## 🚀 Usage
### Quick Start (Fixed Original Script)
```bash
# Edit the video_id in the script, then run:
python transcribe_yt_video.py
```
### Advanced Usage (Enhanced Script)
```bash
# Using video ID
python enhanced_yt_transcript.py dQw4w9WgXcQ
# Using full YouTube URL
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Custom output directory
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
# Specify preferred languages
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
# Get help
python enhanced_yt_transcript.py --help
```
## 📊 Output Structure
After running, you'll get:
```
captions/
├── cc1.txt # First caption segment
├── cc2.txt # Second caption segment
├── cc3.txt # Third caption segment
├── ...
├── cc150.txt # Last segment (example)
└── summary.txt # Summary information
```
Each `cc#.txt` file contains just the text from that caption segment.
## 🔧 Features
### Fixed Original Script
- ✅ **Separate files** for each caption segment
- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
- ✅ **UTF-8 encoding** for international characters
- ✅ **Progress feedback** showing what's being written
### Enhanced Script
- ✅ **Command-line interface** - no need to edit code
- ✅ **URL parsing** - accepts YouTube URLs or video IDs
- ✅ **Language selection** - prefer specific languages
- ✅ **Error handling** - graceful failures with helpful messages
- ✅ **Progress tracking** - shows processing status
- ✅ **Summary file** - metadata about the download
- ✅ **Directory cleanup** - removes old files before new download
## 📋 Requirements
Install the required package:
```bash
pip install youtube-transcript-api
```
## 💡 Usage Examples
### Example 1: Educational Video
```bash
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
```
### Example 2: Multi-language Content
```bash
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
```
### Example 3: Quick Processing
```bash
python enhanced_yt_transcript.py VIDEO_ID
# Creates captions/cc1.txt, captions/cc2.txt, etc.
```
## 🔍 Output Preview
When the script runs, you'll see:
```
🎬 Processing video ID: dQw4w9WgXcQ
✅ Found auto-generated or default transcript
📁 Created directory: captions
📝 Writing 156 segments...
📄 cc10.txt: Never gonna give you up, never gonna let you...
📄 cc20.txt: We've known each other for so long...
📄 cc30.txt: Your heart's been aching but you're too shy...
🎉 Success!
📊 Total segments: 156
📁 Files saved in: /full/path/to/captions/
📋 Summary saved to: captions/summary.txt
```
## 🛠️ Troubleshooting
### Common Issues:
1. **"No transcript found"**
- Video might not have captions/transcripts
- Try a different video with confirmed captions
2. **"Transcripts are disabled"**
- Video owner disabled transcripts
- Try a different video
3. **Module not found**
```bash
pip install youtube-transcript-api
```
### Testing:
Use the test script to verify everything works:
```bash
python test_transcript.py
```
(Remember to replace `TEST_VIDEO_ID` with a real video ID)
## 📝 File Contents Example
**cc1.txt:**
```
Welcome to this tutorial
```
**cc2.txt:**
```
Today we'll be learning about
```
**cc3.txt:**
```
the basics of programming
```
**summary.txt:**
```
YouTube Video ID: dQw4w9WgXcQ
Total segments: 156
Files: cc1.txt to cc156.txt
Generated: enhanced_yt_transcript.py
```
## 🎯 Perfect For:
- **Content analysis** - process each caption separately
- **AI training data** - individual text segments
- **Research projects** - granular transcript analysis
- **Content creation** - extract specific quotes/segments
- **Translation work** - process segments individually
---
**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

174
youtube/enhanced_yt_transcript.py Executable file
View File

@ -0,0 +1,174 @@
#!/usr/bin/env python3
"""
Enhanced YouTube Transcript Downloader
Downloads YouTube video transcripts and saves each segment to separate numbered files (cc1.txt, cc2.txt, etc.)
"""
import os
import sys
import argparse
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
from urllib.parse import urlparse, parse_qs
def extract_video_id(url_or_id):
"""Extract video ID from YouTube URL or return ID if already provided"""
if len(url_or_id) == 11 and url_or_id.isalnum():
return url_or_id
# Parse YouTube URL
parsed_url = urlparse(url_or_id)
if 'youtube.com' in parsed_url.netloc:
return parse_qs(parsed_url.query).get('v', [None])[0]
elif 'youtu.be' in parsed_url.netloc:
return parsed_url.path[1:]
return None
def download_transcript(video_id, output_dir="captions", language_codes=None):
"""Download transcript and save to numbered files"""
try:
# Get available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Try to get transcript in preferred language or auto-generated
transcript = None
if language_codes:
for lang in language_codes:
try:
transcript = transcript_list.find_transcript([lang]).fetch()
print(f"✅ Found transcript in language: {lang}")
break
except NoTranscriptFound:
continue
if not transcript:
# Get any available transcript
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
print("✅ Found auto-generated or default transcript")
except NoTranscriptFound:
print("❌ No transcript found for this video")
return False
# Create output directory and chunks subdirectory
chunks_dir = os.path.join(output_dir, "chunks")
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"📁 Created directory: {output_dir}")
if not os.path.exists(chunks_dir):
os.makedirs(chunks_dir)
print(f"📁 Created chunks directory: {chunks_dir}")
# Clear existing files in chunks directory
for filename in os.listdir(chunks_dir):
if filename.startswith("cc") and filename.endswith(".txt"):
os.remove(os.path.join(chunks_dir, filename))
# Write each segment to separate files
print(f"📝 Writing {len(transcript)} segments...")
for i, entry in enumerate(transcript, 1):
filename = f"cc{i}.txt"
filepath = os.path.join(chunks_dir, filename)
with open(filepath, "w", encoding="utf-8") as f:
f.write(entry['text'])
# Show progress for every 10th file or if text is interesting
if i % 10 == 0 or len(entry['text']) > 50:
preview = entry['text'][:50] + "..." if len(entry['text']) > 50 else entry['text']
print(f" 📄 {filename}: {preview}")
# Create complete transcript file with YouTube ID in filename
complete_filename = f"{video_id}_complete_transcript.txt"
complete_filepath = os.path.join(output_dir, complete_filename)
# Combine all chunks into single file
with open(complete_filepath, "w", encoding="utf-8") as f:
for i in range(1, len(transcript) + 1):
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
if os.path.exists(chunk_file):
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
f.write(chunk_f.read())
print(f"\n🎉 Success!")
print(f"📊 Total segments: {len(transcript)}")
print(f"📁 Individual files saved in: {os.path.abspath(chunks_dir)}/")
print(f"📄 Complete transcript saved as: {complete_filename}")
# Create a summary file
summary_path = os.path.join(output_dir, "summary.txt")
with open(summary_path, "w", encoding="utf-8") as f:
f.write(f"YouTube Video ID: {video_id}\n")
f.write(f"Total segments: {len(transcript)}\n")
f.write(f"Files: chunks/cc1.txt to chunks/cc{len(transcript)}.txt\n")
f.write(f"Complete transcript: {complete_filename}\n")
f.write(f"Generated: {os.path.basename(__file__)}\n")
print(f"📋 Summary saved to: {summary_path}")
return True
except TranscriptsDisabled:
print("❌ Transcripts are disabled for this video")
return False
except NoTranscriptFound:
print("❌ No transcript found for this video")
return False
except Exception as e:
print(f"❌ Error: {str(e)}")
return False
def main():
parser = argparse.ArgumentParser(
description="Download YouTube transcripts to numbered caption files",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s dQw4w9WgXcQ
%(prog)s "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
%(prog)s dQw4w9WgXcQ --output my_captions --languages en es fr
"""
)
parser.add_argument(
"video",
help="YouTube video ID or URL"
)
parser.add_argument(
"--output", "-o",
default="captions",
help="Output directory for caption files (default: captions)"
)
parser.add_argument(
"--languages", "-l",
nargs="*",
default=["en"],
help="Preferred language codes (e.g., en es fr) - default: en"
)
args = parser.parse_args()
# Extract video ID
video_id = extract_video_id(args.video)
if not video_id:
print("❌ Invalid YouTube URL or video ID")
print("Example formats:")
print(" Video ID: dQw4w9WgXcQ")
print(" Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(" Short URL: https://youtu.be/dQw4w9WgXcQ")
sys.exit(1)
print(f"🎬 Processing video ID: {video_id}")
# Download transcript
success = download_transcript(video_id, args.output, args.languages)
if not success:
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,24 @@
#!/usr/bin/env python3
"""
Test script to demonstrate the fixed CC text functionality
Replace 'TEST_VIDEO_ID' with an actual YouTube video ID that has captions
"""
from enhanced_yt_transcript import download_transcript
# Example video ID - replace with a real video ID that has captions
video_id = "TEST_VIDEO_ID" # Replace this with actual video ID
print("🎬 Testing YouTube transcript downloader...")
print(f"📹 Video ID: {video_id}")
print("📝 This will create cc1.txt, cc2.txt, cc3.txt, etc.")
print()
# Test the function
success = download_transcript(video_id, output_dir="test_captions")
if success:
print("\n✅ Test completed successfully!")
print("Check the 'test_captions' directory for cc#.txt files")
else:
print("\n❌ Test failed - make sure to use a valid video ID with captions")

View File

@ -1,24 +1,57 @@
from youtube_transcript_api import YouTubeTranscriptApi
import os
video_id = "VIDEO_ID_HERE" # e.g., 'dQw4w9WgXcQ'
transcript = YouTubeTranscriptApi.get_transcript(video_id)
video_id = "c7bbO_KSLPI" # Video ID from the provided URL
# Create output directory if it doesn't exist
# Create API instance and get transcript
ytt_api = YouTubeTranscriptApi()
transcript_list = ytt_api.list(video_id)
# Try to get transcript in Korean (available for this video)
try:
transcript = transcript_list.find_transcript(['ko']).fetch()
print("✅ Found Korean transcript")
except:
# Get any available transcript
transcript = transcript_list.find_generated_transcript(['ko']).fetch()
print("✅ Found Korean auto-generated transcript")
# Create output directory and chunks subdirectory if they don't exist
output_dir = "captions"
chunks_dir = os.path.join(output_dir, "chunks")
if not os.path.exists(output_dir):
os.makedirs(output_dir)
if not os.path.exists(chunks_dir):
os.makedirs(chunks_dir)
# Write each caption segment to separate numbered files
# Clear existing files in chunks directory
for filename in os.listdir(chunks_dir):
if filename.startswith("cc") and filename.endswith(".txt"):
os.remove(os.path.join(chunks_dir, filename))
# Write each caption segment to separate numbered files in chunks folder
for i, entry in enumerate(transcript, 1):
filename = f"cc{i}.txt"
filepath = os.path.join(output_dir, filename)
filepath = os.path.join(chunks_dir, filename)
with open(filepath, "w", encoding="utf-8") as f:
f.write(entry['text'])
f.write(entry.text)
print(f"Written: {filename} - {entry['text'][:50]}...")
print(f"Written: {filename} - {entry.text[:50]}...")
# Create complete transcript file with YouTube ID in filename
complete_filename = f"{video_id}_complete_transcript.txt"
complete_filepath = os.path.join(output_dir, complete_filename)
# Combine all chunks into single file
with open(complete_filepath, "w", encoding="utf-8") as f:
for i in range(1, len(transcript) + 1):
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
if os.path.exists(chunk_file):
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
f.write(chunk_f.read())
print(f"\nTotal segments: {len(transcript)}")
print(f"Files saved in: {output_dir}/")
print(f"Individual files saved in: {chunks_dir}/")
print(f"Complete transcript saved as: {complete_filename}")