feat: transcript generation
This commit is contained in:
parent
467e213b2e
commit
b690719ff1
29
.gitignore
vendored
Normal file
29
.gitignore
vendored
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
# YouTube transcript chunks
|
||||||
|
cc*.txt
|
||||||
|
|
||||||
|
# Python cache files
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
|
||||||
|
# Virtual environment
|
||||||
|
venv/
|
||||||
|
env/
|
||||||
|
ENV/
|
||||||
|
|
||||||
|
# IDE files
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
|
||||||
|
# OS files
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Log files
|
||||||
|
*.log
|
||||||
|
|
||||||
|
# Temporary files
|
||||||
|
*.tmp
|
||||||
|
*.temp
|
||||||
187
youtube/README_youtube_transcripts.md
Normal file
187
youtube/README_youtube_transcripts.md
Normal file
@ -0,0 +1,187 @@
|
|||||||
|
# 🎬 YouTube Transcript to Numbered Files
|
||||||
|
|
||||||
|
Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
|
||||||
|
|
||||||
|
## ✅ What Was Fixed
|
||||||
|
|
||||||
|
The original script wrote all captions to a single `captions.txt` file. Now it:
|
||||||
|
- **Creates separate files** for each caption segment
|
||||||
|
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
|
||||||
|
- **Organizes output** in a dedicated directory
|
||||||
|
- **Handles errors** gracefully
|
||||||
|
- **Shows progress** during processing
|
||||||
|
|
||||||
|
## 📁 Available Scripts
|
||||||
|
|
||||||
|
### 1. `transcribe_yt_video.py` (Fixed Original)
|
||||||
|
The minimal fixed version of your original script.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Just change the video ID and run
|
||||||
|
video_id = "dQw4w9WgXcQ" # Replace with your video ID
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. `enhanced_yt_transcript.py` (Recommended)
|
||||||
|
Full-featured script with command-line interface and error handling.
|
||||||
|
|
||||||
|
## 🚀 Usage
|
||||||
|
|
||||||
|
### Quick Start (Fixed Original Script)
|
||||||
|
```bash
|
||||||
|
# Edit the video_id in the script, then run:
|
||||||
|
python transcribe_yt_video.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Usage (Enhanced Script)
|
||||||
|
```bash
|
||||||
|
# Using video ID
|
||||||
|
python enhanced_yt_transcript.py dQw4w9WgXcQ
|
||||||
|
|
||||||
|
# Using full YouTube URL
|
||||||
|
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||||
|
|
||||||
|
# Custom output directory
|
||||||
|
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
|
||||||
|
|
||||||
|
# Specify preferred languages
|
||||||
|
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
|
||||||
|
|
||||||
|
# Get help
|
||||||
|
python enhanced_yt_transcript.py --help
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📊 Output Structure
|
||||||
|
|
||||||
|
After running, you'll get:
|
||||||
|
```
|
||||||
|
captions/
|
||||||
|
├── cc1.txt # First caption segment
|
||||||
|
├── cc2.txt # Second caption segment
|
||||||
|
├── cc3.txt # Third caption segment
|
||||||
|
├── ...
|
||||||
|
├── cc150.txt # Last segment (example)
|
||||||
|
└── summary.txt # Summary information
|
||||||
|
```
|
||||||
|
|
||||||
|
Each `cc#.txt` file contains just the text from that caption segment.
|
||||||
|
|
||||||
|
## 🔧 Features
|
||||||
|
|
||||||
|
### Fixed Original Script
|
||||||
|
- ✅ **Separate files** for each caption segment
|
||||||
|
- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
|
||||||
|
- ✅ **UTF-8 encoding** for international characters
|
||||||
|
- ✅ **Progress feedback** showing what's being written
|
||||||
|
|
||||||
|
### Enhanced Script
|
||||||
|
- ✅ **Command-line interface** - no need to edit code
|
||||||
|
- ✅ **URL parsing** - accepts YouTube URLs or video IDs
|
||||||
|
- ✅ **Language selection** - prefer specific languages
|
||||||
|
- ✅ **Error handling** - graceful failures with helpful messages
|
||||||
|
- ✅ **Progress tracking** - shows processing status
|
||||||
|
- ✅ **Summary file** - metadata about the download
|
||||||
|
- ✅ **Directory cleanup** - removes old files before new download
|
||||||
|
|
||||||
|
## 📋 Requirements
|
||||||
|
|
||||||
|
Install the required package:
|
||||||
|
```bash
|
||||||
|
pip install youtube-transcript-api
|
||||||
|
```
|
||||||
|
|
||||||
|
## 💡 Usage Examples
|
||||||
|
|
||||||
|
### Example 1: Educational Video
|
||||||
|
```bash
|
||||||
|
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 2: Multi-language Content
|
||||||
|
```bash
|
||||||
|
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 3: Quick Processing
|
||||||
|
```bash
|
||||||
|
python enhanced_yt_transcript.py VIDEO_ID
|
||||||
|
# Creates captions/cc1.txt, captions/cc2.txt, etc.
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔍 Output Preview
|
||||||
|
|
||||||
|
When the script runs, you'll see:
|
||||||
|
```
|
||||||
|
🎬 Processing video ID: dQw4w9WgXcQ
|
||||||
|
✅ Found auto-generated or default transcript
|
||||||
|
📁 Created directory: captions
|
||||||
|
📝 Writing 156 segments...
|
||||||
|
📄 cc10.txt: Never gonna give you up, never gonna let you...
|
||||||
|
📄 cc20.txt: We've known each other for so long...
|
||||||
|
📄 cc30.txt: Your heart's been aching but you're too shy...
|
||||||
|
|
||||||
|
🎉 Success!
|
||||||
|
📊 Total segments: 156
|
||||||
|
📁 Files saved in: /full/path/to/captions/
|
||||||
|
📋 Summary saved to: captions/summary.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🛠️ Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues:
|
||||||
|
|
||||||
|
1. **"No transcript found"**
|
||||||
|
- Video might not have captions/transcripts
|
||||||
|
- Try a different video with confirmed captions
|
||||||
|
|
||||||
|
2. **"Transcripts are disabled"**
|
||||||
|
- Video owner disabled transcripts
|
||||||
|
- Try a different video
|
||||||
|
|
||||||
|
3. **Module not found**
|
||||||
|
```bash
|
||||||
|
pip install youtube-transcript-api
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing:
|
||||||
|
Use the test script to verify everything works:
|
||||||
|
```bash
|
||||||
|
python test_transcript.py
|
||||||
|
```
|
||||||
|
(Remember to replace `TEST_VIDEO_ID` with a real video ID)
|
||||||
|
|
||||||
|
## 📝 File Contents Example
|
||||||
|
|
||||||
|
**cc1.txt:**
|
||||||
|
```
|
||||||
|
Welcome to this tutorial
|
||||||
|
```
|
||||||
|
|
||||||
|
**cc2.txt:**
|
||||||
|
```
|
||||||
|
Today we'll be learning about
|
||||||
|
```
|
||||||
|
|
||||||
|
**cc3.txt:**
|
||||||
|
```
|
||||||
|
the basics of programming
|
||||||
|
```
|
||||||
|
|
||||||
|
**summary.txt:**
|
||||||
|
```
|
||||||
|
YouTube Video ID: dQw4w9WgXcQ
|
||||||
|
Total segments: 156
|
||||||
|
Files: cc1.txt to cc156.txt
|
||||||
|
Generated: enhanced_yt_transcript.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎯 Perfect For:
|
||||||
|
|
||||||
|
- **Content analysis** - process each caption separately
|
||||||
|
- **AI training data** - individual text segments
|
||||||
|
- **Research projects** - granular transcript analysis
|
||||||
|
- **Content creation** - extract specific quotes/segments
|
||||||
|
- **Translation work** - process segments individually
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉
|
||||||
1
youtube/captions/c7bbO_KSLPI_complete_transcript.txt
Normal file
1
youtube/captions/c7bbO_KSLPI_complete_transcript.txt
Normal file
File diff suppressed because one or more lines are too long
1
youtube/captions/complete_transcript.txt
Normal file
1
youtube/captions/complete_transcript.txt
Normal file
File diff suppressed because one or more lines are too long
174
youtube/enhanced_yt_transcript.py
Executable file
174
youtube/enhanced_yt_transcript.py
Executable file
@ -0,0 +1,174 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Enhanced YouTube Transcript Downloader
|
||||||
|
Downloads YouTube video transcripts and saves each segment to separate numbered files (cc1.txt, cc2.txt, etc.)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import argparse
|
||||||
|
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
|
||||||
|
from urllib.parse import urlparse, parse_qs
|
||||||
|
|
||||||
|
def extract_video_id(url_or_id):
|
||||||
|
"""Extract video ID from YouTube URL or return ID if already provided"""
|
||||||
|
if len(url_or_id) == 11 and url_or_id.isalnum():
|
||||||
|
return url_or_id
|
||||||
|
|
||||||
|
# Parse YouTube URL
|
||||||
|
parsed_url = urlparse(url_or_id)
|
||||||
|
|
||||||
|
if 'youtube.com' in parsed_url.netloc:
|
||||||
|
return parse_qs(parsed_url.query).get('v', [None])[0]
|
||||||
|
elif 'youtu.be' in parsed_url.netloc:
|
||||||
|
return parsed_url.path[1:]
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def download_transcript(video_id, output_dir="captions", language_codes=None):
|
||||||
|
"""Download transcript and save to numbered files"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get available transcripts
|
||||||
|
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
|
||||||
|
|
||||||
|
# Try to get transcript in preferred language or auto-generated
|
||||||
|
transcript = None
|
||||||
|
if language_codes:
|
||||||
|
for lang in language_codes:
|
||||||
|
try:
|
||||||
|
transcript = transcript_list.find_transcript([lang]).fetch()
|
||||||
|
print(f"✅ Found transcript in language: {lang}")
|
||||||
|
break
|
||||||
|
except NoTranscriptFound:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not transcript:
|
||||||
|
# Get any available transcript
|
||||||
|
try:
|
||||||
|
transcript = YouTubeTranscriptApi.get_transcript(video_id)
|
||||||
|
print("✅ Found auto-generated or default transcript")
|
||||||
|
except NoTranscriptFound:
|
||||||
|
print("❌ No transcript found for this video")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Create output directory and chunks subdirectory
|
||||||
|
chunks_dir = os.path.join(output_dir, "chunks")
|
||||||
|
if not os.path.exists(output_dir):
|
||||||
|
os.makedirs(output_dir)
|
||||||
|
print(f"📁 Created directory: {output_dir}")
|
||||||
|
if not os.path.exists(chunks_dir):
|
||||||
|
os.makedirs(chunks_dir)
|
||||||
|
print(f"📁 Created chunks directory: {chunks_dir}")
|
||||||
|
|
||||||
|
# Clear existing files in chunks directory
|
||||||
|
for filename in os.listdir(chunks_dir):
|
||||||
|
if filename.startswith("cc") and filename.endswith(".txt"):
|
||||||
|
os.remove(os.path.join(chunks_dir, filename))
|
||||||
|
|
||||||
|
# Write each segment to separate files
|
||||||
|
print(f"📝 Writing {len(transcript)} segments...")
|
||||||
|
|
||||||
|
for i, entry in enumerate(transcript, 1):
|
||||||
|
filename = f"cc{i}.txt"
|
||||||
|
filepath = os.path.join(chunks_dir, filename)
|
||||||
|
|
||||||
|
with open(filepath, "w", encoding="utf-8") as f:
|
||||||
|
f.write(entry['text'])
|
||||||
|
|
||||||
|
# Show progress for every 10th file or if text is interesting
|
||||||
|
if i % 10 == 0 or len(entry['text']) > 50:
|
||||||
|
preview = entry['text'][:50] + "..." if len(entry['text']) > 50 else entry['text']
|
||||||
|
print(f" 📄 {filename}: {preview}")
|
||||||
|
|
||||||
|
# Create complete transcript file with YouTube ID in filename
|
||||||
|
complete_filename = f"{video_id}_complete_transcript.txt"
|
||||||
|
complete_filepath = os.path.join(output_dir, complete_filename)
|
||||||
|
|
||||||
|
# Combine all chunks into single file
|
||||||
|
with open(complete_filepath, "w", encoding="utf-8") as f:
|
||||||
|
for i in range(1, len(transcript) + 1):
|
||||||
|
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
|
||||||
|
if os.path.exists(chunk_file):
|
||||||
|
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
|
||||||
|
f.write(chunk_f.read())
|
||||||
|
|
||||||
|
print(f"\n🎉 Success!")
|
||||||
|
print(f"📊 Total segments: {len(transcript)}")
|
||||||
|
print(f"📁 Individual files saved in: {os.path.abspath(chunks_dir)}/")
|
||||||
|
print(f"📄 Complete transcript saved as: {complete_filename}")
|
||||||
|
|
||||||
|
# Create a summary file
|
||||||
|
summary_path = os.path.join(output_dir, "summary.txt")
|
||||||
|
with open(summary_path, "w", encoding="utf-8") as f:
|
||||||
|
f.write(f"YouTube Video ID: {video_id}\n")
|
||||||
|
f.write(f"Total segments: {len(transcript)}\n")
|
||||||
|
f.write(f"Files: chunks/cc1.txt to chunks/cc{len(transcript)}.txt\n")
|
||||||
|
f.write(f"Complete transcript: {complete_filename}\n")
|
||||||
|
f.write(f"Generated: {os.path.basename(__file__)}\n")
|
||||||
|
|
||||||
|
print(f"📋 Summary saved to: {summary_path}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
except TranscriptsDisabled:
|
||||||
|
print("❌ Transcripts are disabled for this video")
|
||||||
|
return False
|
||||||
|
except NoTranscriptFound:
|
||||||
|
print("❌ No transcript found for this video")
|
||||||
|
return False
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Download YouTube transcripts to numbered caption files",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
%(prog)s dQw4w9WgXcQ
|
||||||
|
%(prog)s "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||||
|
%(prog)s dQw4w9WgXcQ --output my_captions --languages en es fr
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"video",
|
||||||
|
help="YouTube video ID or URL"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--output", "-o",
|
||||||
|
default="captions",
|
||||||
|
help="Output directory for caption files (default: captions)"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--languages", "-l",
|
||||||
|
nargs="*",
|
||||||
|
default=["en"],
|
||||||
|
help="Preferred language codes (e.g., en es fr) - default: en"
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Extract video ID
|
||||||
|
video_id = extract_video_id(args.video)
|
||||||
|
if not video_id:
|
||||||
|
print("❌ Invalid YouTube URL or video ID")
|
||||||
|
print("Example formats:")
|
||||||
|
print(" Video ID: dQw4w9WgXcQ")
|
||||||
|
print(" Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ")
|
||||||
|
print(" Short URL: https://youtu.be/dQw4w9WgXcQ")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"🎬 Processing video ID: {video_id}")
|
||||||
|
|
||||||
|
# Download transcript
|
||||||
|
success = download_transcript(video_id, args.output, args.languages)
|
||||||
|
|
||||||
|
if not success:
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
24
youtube/test_transcript.py
Normal file
24
youtube/test_transcript.py
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test script to demonstrate the fixed CC text functionality
|
||||||
|
Replace 'TEST_VIDEO_ID' with an actual YouTube video ID that has captions
|
||||||
|
"""
|
||||||
|
|
||||||
|
from enhanced_yt_transcript import download_transcript
|
||||||
|
|
||||||
|
# Example video ID - replace with a real video ID that has captions
|
||||||
|
video_id = "TEST_VIDEO_ID" # Replace this with actual video ID
|
||||||
|
|
||||||
|
print("🎬 Testing YouTube transcript downloader...")
|
||||||
|
print(f"📹 Video ID: {video_id}")
|
||||||
|
print("📝 This will create cc1.txt, cc2.txt, cc3.txt, etc.")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Test the function
|
||||||
|
success = download_transcript(video_id, output_dir="test_captions")
|
||||||
|
|
||||||
|
if success:
|
||||||
|
print("\n✅ Test completed successfully!")
|
||||||
|
print("Check the 'test_captions' directory for cc#.txt files")
|
||||||
|
else:
|
||||||
|
print("\n❌ Test failed - make sure to use a valid video ID with captions")
|
||||||
@ -1,24 +1,57 @@
|
|||||||
from youtube_transcript_api import YouTubeTranscriptApi
|
from youtube_transcript_api import YouTubeTranscriptApi
|
||||||
import os
|
import os
|
||||||
|
|
||||||
video_id = "VIDEO_ID_HERE" # e.g., 'dQw4w9WgXcQ'
|
video_id = "c7bbO_KSLPI" # Video ID from the provided URL
|
||||||
transcript = YouTubeTranscriptApi.get_transcript(video_id)
|
|
||||||
|
|
||||||
# Create output directory if it doesn't exist
|
# Create API instance and get transcript
|
||||||
|
ytt_api = YouTubeTranscriptApi()
|
||||||
|
transcript_list = ytt_api.list(video_id)
|
||||||
|
|
||||||
|
# Try to get transcript in Korean (available for this video)
|
||||||
|
try:
|
||||||
|
transcript = transcript_list.find_transcript(['ko']).fetch()
|
||||||
|
print("✅ Found Korean transcript")
|
||||||
|
except:
|
||||||
|
# Get any available transcript
|
||||||
|
transcript = transcript_list.find_generated_transcript(['ko']).fetch()
|
||||||
|
print("✅ Found Korean auto-generated transcript")
|
||||||
|
|
||||||
|
# Create output directory and chunks subdirectory if they don't exist
|
||||||
output_dir = "captions"
|
output_dir = "captions"
|
||||||
|
chunks_dir = os.path.join(output_dir, "chunks")
|
||||||
if not os.path.exists(output_dir):
|
if not os.path.exists(output_dir):
|
||||||
os.makedirs(output_dir)
|
os.makedirs(output_dir)
|
||||||
|
if not os.path.exists(chunks_dir):
|
||||||
|
os.makedirs(chunks_dir)
|
||||||
|
|
||||||
# Write each caption segment to separate numbered files
|
# Clear existing files in chunks directory
|
||||||
|
for filename in os.listdir(chunks_dir):
|
||||||
|
if filename.startswith("cc") and filename.endswith(".txt"):
|
||||||
|
os.remove(os.path.join(chunks_dir, filename))
|
||||||
|
|
||||||
|
# Write each caption segment to separate numbered files in chunks folder
|
||||||
for i, entry in enumerate(transcript, 1):
|
for i, entry in enumerate(transcript, 1):
|
||||||
filename = f"cc{i}.txt"
|
filename = f"cc{i}.txt"
|
||||||
filepath = os.path.join(output_dir, filename)
|
filepath = os.path.join(chunks_dir, filename)
|
||||||
|
|
||||||
with open(filepath, "w", encoding="utf-8") as f:
|
with open(filepath, "w", encoding="utf-8") as f:
|
||||||
f.write(entry['text'])
|
f.write(entry.text)
|
||||||
|
|
||||||
print(f"Written: {filename} - {entry['text'][:50]}...")
|
print(f"Written: {filename} - {entry.text[:50]}...")
|
||||||
|
|
||||||
|
# Create complete transcript file with YouTube ID in filename
|
||||||
|
complete_filename = f"{video_id}_complete_transcript.txt"
|
||||||
|
complete_filepath = os.path.join(output_dir, complete_filename)
|
||||||
|
|
||||||
|
# Combine all chunks into single file
|
||||||
|
with open(complete_filepath, "w", encoding="utf-8") as f:
|
||||||
|
for i in range(1, len(transcript) + 1):
|
||||||
|
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
|
||||||
|
if os.path.exists(chunk_file):
|
||||||
|
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
|
||||||
|
f.write(chunk_f.read())
|
||||||
|
|
||||||
print(f"\nTotal segments: {len(transcript)}")
|
print(f"\nTotal segments: {len(transcript)}")
|
||||||
print(f"Files saved in: {output_dir}/")
|
print(f"Individual files saved in: {chunks_dir}/")
|
||||||
|
print(f"Complete transcript saved as: {complete_filename}")
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user