feat: transcript generation
This commit is contained in:
parent
467e213b2e
commit
b690719ff1
29
.gitignore
vendored
Normal file
29
.gitignore
vendored
Normal file
@ -0,0 +1,29 @@
|
||||
# YouTube transcript chunks
|
||||
cc*.txt
|
||||
|
||||
# Python cache files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# Virtual environment
|
||||
venv/
|
||||
env/
|
||||
ENV/
|
||||
|
||||
# IDE files
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Log files
|
||||
*.log
|
||||
|
||||
# Temporary files
|
||||
*.tmp
|
||||
*.temp
|
||||
187
youtube/README_youtube_transcripts.md
Normal file
187
youtube/README_youtube_transcripts.md
Normal file
@ -0,0 +1,187 @@
|
||||
# 🎬 YouTube Transcript to Numbered Files
|
||||
|
||||
Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
|
||||
|
||||
## ✅ What Was Fixed
|
||||
|
||||
The original script wrote all captions to a single `captions.txt` file. Now it:
|
||||
- **Creates separate files** for each caption segment
|
||||
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
|
||||
- **Organizes output** in a dedicated directory
|
||||
- **Handles errors** gracefully
|
||||
- **Shows progress** during processing
|
||||
|
||||
## 📁 Available Scripts
|
||||
|
||||
### 1. `transcribe_yt_video.py` (Fixed Original)
|
||||
The minimal fixed version of your original script.
|
||||
|
||||
```python
|
||||
# Just change the video ID and run
|
||||
video_id = "dQw4w9WgXcQ" # Replace with your video ID
|
||||
```
|
||||
|
||||
### 2. `enhanced_yt_transcript.py` (Recommended)
|
||||
Full-featured script with command-line interface and error handling.
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
### Quick Start (Fixed Original Script)
|
||||
```bash
|
||||
# Edit the video_id in the script, then run:
|
||||
python transcribe_yt_video.py
|
||||
```
|
||||
|
||||
### Advanced Usage (Enhanced Script)
|
||||
```bash
|
||||
# Using video ID
|
||||
python enhanced_yt_transcript.py dQw4w9WgXcQ
|
||||
|
||||
# Using full YouTube URL
|
||||
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||
|
||||
# Custom output directory
|
||||
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
|
||||
|
||||
# Specify preferred languages
|
||||
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
|
||||
|
||||
# Get help
|
||||
python enhanced_yt_transcript.py --help
|
||||
```
|
||||
|
||||
## 📊 Output Structure
|
||||
|
||||
After running, you'll get:
|
||||
```
|
||||
captions/
|
||||
├── cc1.txt # First caption segment
|
||||
├── cc2.txt # Second caption segment
|
||||
├── cc3.txt # Third caption segment
|
||||
├── ...
|
||||
├── cc150.txt # Last segment (example)
|
||||
└── summary.txt # Summary information
|
||||
```
|
||||
|
||||
Each `cc#.txt` file contains just the text from that caption segment.
|
||||
|
||||
## 🔧 Features
|
||||
|
||||
### Fixed Original Script
|
||||
- ✅ **Separate files** for each caption segment
|
||||
- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
|
||||
- ✅ **UTF-8 encoding** for international characters
|
||||
- ✅ **Progress feedback** showing what's being written
|
||||
|
||||
### Enhanced Script
|
||||
- ✅ **Command-line interface** - no need to edit code
|
||||
- ✅ **URL parsing** - accepts YouTube URLs or video IDs
|
||||
- ✅ **Language selection** - prefer specific languages
|
||||
- ✅ **Error handling** - graceful failures with helpful messages
|
||||
- ✅ **Progress tracking** - shows processing status
|
||||
- ✅ **Summary file** - metadata about the download
|
||||
- ✅ **Directory cleanup** - removes old files before new download
|
||||
|
||||
## 📋 Requirements
|
||||
|
||||
Install the required package:
|
||||
```bash
|
||||
pip install youtube-transcript-api
|
||||
```
|
||||
|
||||
## 💡 Usage Examples
|
||||
|
||||
### Example 1: Educational Video
|
||||
```bash
|
||||
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
|
||||
```
|
||||
|
||||
### Example 2: Multi-language Content
|
||||
```bash
|
||||
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
|
||||
```
|
||||
|
||||
### Example 3: Quick Processing
|
||||
```bash
|
||||
python enhanced_yt_transcript.py VIDEO_ID
|
||||
# Creates captions/cc1.txt, captions/cc2.txt, etc.
|
||||
```
|
||||
|
||||
## 🔍 Output Preview
|
||||
|
||||
When the script runs, you'll see:
|
||||
```
|
||||
🎬 Processing video ID: dQw4w9WgXcQ
|
||||
✅ Found auto-generated or default transcript
|
||||
📁 Created directory: captions
|
||||
📝 Writing 156 segments...
|
||||
📄 cc10.txt: Never gonna give you up, never gonna let you...
|
||||
📄 cc20.txt: We've known each other for so long...
|
||||
📄 cc30.txt: Your heart's been aching but you're too shy...
|
||||
|
||||
🎉 Success!
|
||||
📊 Total segments: 156
|
||||
📁 Files saved in: /full/path/to/captions/
|
||||
📋 Summary saved to: captions/summary.txt
|
||||
```
|
||||
|
||||
## 🛠️ Troubleshooting
|
||||
|
||||
### Common Issues:
|
||||
|
||||
1. **"No transcript found"**
|
||||
- Video might not have captions/transcripts
|
||||
- Try a different video with confirmed captions
|
||||
|
||||
2. **"Transcripts are disabled"**
|
||||
- Video owner disabled transcripts
|
||||
- Try a different video
|
||||
|
||||
3. **Module not found**
|
||||
```bash
|
||||
pip install youtube-transcript-api
|
||||
```
|
||||
|
||||
### Testing:
|
||||
Use the test script to verify everything works:
|
||||
```bash
|
||||
python test_transcript.py
|
||||
```
|
||||
(Remember to replace `TEST_VIDEO_ID` with a real video ID)
|
||||
|
||||
## 📝 File Contents Example
|
||||
|
||||
**cc1.txt:**
|
||||
```
|
||||
Welcome to this tutorial
|
||||
```
|
||||
|
||||
**cc2.txt:**
|
||||
```
|
||||
Today we'll be learning about
|
||||
```
|
||||
|
||||
**cc3.txt:**
|
||||
```
|
||||
the basics of programming
|
||||
```
|
||||
|
||||
**summary.txt:**
|
||||
```
|
||||
YouTube Video ID: dQw4w9WgXcQ
|
||||
Total segments: 156
|
||||
Files: cc1.txt to cc156.txt
|
||||
Generated: enhanced_yt_transcript.py
|
||||
```
|
||||
|
||||
## 🎯 Perfect For:
|
||||
|
||||
- **Content analysis** - process each caption separately
|
||||
- **AI training data** - individual text segments
|
||||
- **Research projects** - granular transcript analysis
|
||||
- **Content creation** - extract specific quotes/segments
|
||||
- **Translation work** - process segments individually
|
||||
|
||||
---
|
||||
|
||||
**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉
|
||||
1
youtube/captions/c7bbO_KSLPI_complete_transcript.txt
Normal file
1
youtube/captions/c7bbO_KSLPI_complete_transcript.txt
Normal file
File diff suppressed because one or more lines are too long
1
youtube/captions/complete_transcript.txt
Normal file
1
youtube/captions/complete_transcript.txt
Normal file
File diff suppressed because one or more lines are too long
174
youtube/enhanced_yt_transcript.py
Executable file
174
youtube/enhanced_yt_transcript.py
Executable file
@ -0,0 +1,174 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Enhanced YouTube Transcript Downloader
|
||||
Downloads YouTube video transcripts and saves each segment to separate numbered files (cc1.txt, cc2.txt, etc.)
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
|
||||
from urllib.parse import urlparse, parse_qs
|
||||
|
||||
def extract_video_id(url_or_id):
|
||||
"""Extract video ID from YouTube URL or return ID if already provided"""
|
||||
if len(url_or_id) == 11 and url_or_id.isalnum():
|
||||
return url_or_id
|
||||
|
||||
# Parse YouTube URL
|
||||
parsed_url = urlparse(url_or_id)
|
||||
|
||||
if 'youtube.com' in parsed_url.netloc:
|
||||
return parse_qs(parsed_url.query).get('v', [None])[0]
|
||||
elif 'youtu.be' in parsed_url.netloc:
|
||||
return parsed_url.path[1:]
|
||||
|
||||
return None
|
||||
|
||||
def download_transcript(video_id, output_dir="captions", language_codes=None):
|
||||
"""Download transcript and save to numbered files"""
|
||||
|
||||
try:
|
||||
# Get available transcripts
|
||||
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
|
||||
|
||||
# Try to get transcript in preferred language or auto-generated
|
||||
transcript = None
|
||||
if language_codes:
|
||||
for lang in language_codes:
|
||||
try:
|
||||
transcript = transcript_list.find_transcript([lang]).fetch()
|
||||
print(f"✅ Found transcript in language: {lang}")
|
||||
break
|
||||
except NoTranscriptFound:
|
||||
continue
|
||||
|
||||
if not transcript:
|
||||
# Get any available transcript
|
||||
try:
|
||||
transcript = YouTubeTranscriptApi.get_transcript(video_id)
|
||||
print("✅ Found auto-generated or default transcript")
|
||||
except NoTranscriptFound:
|
||||
print("❌ No transcript found for this video")
|
||||
return False
|
||||
|
||||
# Create output directory and chunks subdirectory
|
||||
chunks_dir = os.path.join(output_dir, "chunks")
|
||||
if not os.path.exists(output_dir):
|
||||
os.makedirs(output_dir)
|
||||
print(f"📁 Created directory: {output_dir}")
|
||||
if not os.path.exists(chunks_dir):
|
||||
os.makedirs(chunks_dir)
|
||||
print(f"📁 Created chunks directory: {chunks_dir}")
|
||||
|
||||
# Clear existing files in chunks directory
|
||||
for filename in os.listdir(chunks_dir):
|
||||
if filename.startswith("cc") and filename.endswith(".txt"):
|
||||
os.remove(os.path.join(chunks_dir, filename))
|
||||
|
||||
# Write each segment to separate files
|
||||
print(f"📝 Writing {len(transcript)} segments...")
|
||||
|
||||
for i, entry in enumerate(transcript, 1):
|
||||
filename = f"cc{i}.txt"
|
||||
filepath = os.path.join(chunks_dir, filename)
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
f.write(entry['text'])
|
||||
|
||||
# Show progress for every 10th file or if text is interesting
|
||||
if i % 10 == 0 or len(entry['text']) > 50:
|
||||
preview = entry['text'][:50] + "..." if len(entry['text']) > 50 else entry['text']
|
||||
print(f" 📄 {filename}: {preview}")
|
||||
|
||||
# Create complete transcript file with YouTube ID in filename
|
||||
complete_filename = f"{video_id}_complete_transcript.txt"
|
||||
complete_filepath = os.path.join(output_dir, complete_filename)
|
||||
|
||||
# Combine all chunks into single file
|
||||
with open(complete_filepath, "w", encoding="utf-8") as f:
|
||||
for i in range(1, len(transcript) + 1):
|
||||
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
|
||||
if os.path.exists(chunk_file):
|
||||
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
|
||||
f.write(chunk_f.read())
|
||||
|
||||
print(f"\n🎉 Success!")
|
||||
print(f"📊 Total segments: {len(transcript)}")
|
||||
print(f"📁 Individual files saved in: {os.path.abspath(chunks_dir)}/")
|
||||
print(f"📄 Complete transcript saved as: {complete_filename}")
|
||||
|
||||
# Create a summary file
|
||||
summary_path = os.path.join(output_dir, "summary.txt")
|
||||
with open(summary_path, "w", encoding="utf-8") as f:
|
||||
f.write(f"YouTube Video ID: {video_id}\n")
|
||||
f.write(f"Total segments: {len(transcript)}\n")
|
||||
f.write(f"Files: chunks/cc1.txt to chunks/cc{len(transcript)}.txt\n")
|
||||
f.write(f"Complete transcript: {complete_filename}\n")
|
||||
f.write(f"Generated: {os.path.basename(__file__)}\n")
|
||||
|
||||
print(f"📋 Summary saved to: {summary_path}")
|
||||
return True
|
||||
|
||||
except TranscriptsDisabled:
|
||||
print("❌ Transcripts are disabled for this video")
|
||||
return False
|
||||
except NoTranscriptFound:
|
||||
print("❌ No transcript found for this video")
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Download YouTube transcripts to numbered caption files",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s dQw4w9WgXcQ
|
||||
%(prog)s "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||
%(prog)s dQw4w9WgXcQ --output my_captions --languages en es fr
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"video",
|
||||
help="YouTube video ID or URL"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--output", "-o",
|
||||
default="captions",
|
||||
help="Output directory for caption files (default: captions)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--languages", "-l",
|
||||
nargs="*",
|
||||
default=["en"],
|
||||
help="Preferred language codes (e.g., en es fr) - default: en"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Extract video ID
|
||||
video_id = extract_video_id(args.video)
|
||||
if not video_id:
|
||||
print("❌ Invalid YouTube URL or video ID")
|
||||
print("Example formats:")
|
||||
print(" Video ID: dQw4w9WgXcQ")
|
||||
print(" Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ")
|
||||
print(" Short URL: https://youtu.be/dQw4w9WgXcQ")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"🎬 Processing video ID: {video_id}")
|
||||
|
||||
# Download transcript
|
||||
success = download_transcript(video_id, args.output, args.languages)
|
||||
|
||||
if not success:
|
||||
sys.exit(1)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
24
youtube/test_transcript.py
Normal file
24
youtube/test_transcript.py
Normal file
@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script to demonstrate the fixed CC text functionality
|
||||
Replace 'TEST_VIDEO_ID' with an actual YouTube video ID that has captions
|
||||
"""
|
||||
|
||||
from enhanced_yt_transcript import download_transcript
|
||||
|
||||
# Example video ID - replace with a real video ID that has captions
|
||||
video_id = "TEST_VIDEO_ID" # Replace this with actual video ID
|
||||
|
||||
print("🎬 Testing YouTube transcript downloader...")
|
||||
print(f"📹 Video ID: {video_id}")
|
||||
print("📝 This will create cc1.txt, cc2.txt, cc3.txt, etc.")
|
||||
print()
|
||||
|
||||
# Test the function
|
||||
success = download_transcript(video_id, output_dir="test_captions")
|
||||
|
||||
if success:
|
||||
print("\n✅ Test completed successfully!")
|
||||
print("Check the 'test_captions' directory for cc#.txt files")
|
||||
else:
|
||||
print("\n❌ Test failed - make sure to use a valid video ID with captions")
|
||||
@ -1,24 +1,57 @@
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
import os
|
||||
|
||||
video_id = "VIDEO_ID_HERE" # e.g., 'dQw4w9WgXcQ'
|
||||
transcript = YouTubeTranscriptApi.get_transcript(video_id)
|
||||
video_id = "c7bbO_KSLPI" # Video ID from the provided URL
|
||||
|
||||
# Create output directory if it doesn't exist
|
||||
# Create API instance and get transcript
|
||||
ytt_api = YouTubeTranscriptApi()
|
||||
transcript_list = ytt_api.list(video_id)
|
||||
|
||||
# Try to get transcript in Korean (available for this video)
|
||||
try:
|
||||
transcript = transcript_list.find_transcript(['ko']).fetch()
|
||||
print("✅ Found Korean transcript")
|
||||
except:
|
||||
# Get any available transcript
|
||||
transcript = transcript_list.find_generated_transcript(['ko']).fetch()
|
||||
print("✅ Found Korean auto-generated transcript")
|
||||
|
||||
# Create output directory and chunks subdirectory if they don't exist
|
||||
output_dir = "captions"
|
||||
chunks_dir = os.path.join(output_dir, "chunks")
|
||||
if not os.path.exists(output_dir):
|
||||
os.makedirs(output_dir)
|
||||
if not os.path.exists(chunks_dir):
|
||||
os.makedirs(chunks_dir)
|
||||
|
||||
# Write each caption segment to separate numbered files
|
||||
# Clear existing files in chunks directory
|
||||
for filename in os.listdir(chunks_dir):
|
||||
if filename.startswith("cc") and filename.endswith(".txt"):
|
||||
os.remove(os.path.join(chunks_dir, filename))
|
||||
|
||||
# Write each caption segment to separate numbered files in chunks folder
|
||||
for i, entry in enumerate(transcript, 1):
|
||||
filename = f"cc{i}.txt"
|
||||
filepath = os.path.join(output_dir, filename)
|
||||
filepath = os.path.join(chunks_dir, filename)
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
f.write(entry['text'])
|
||||
f.write(entry.text)
|
||||
|
||||
print(f"Written: {filename} - {entry['text'][:50]}...")
|
||||
print(f"Written: {filename} - {entry.text[:50]}...")
|
||||
|
||||
# Create complete transcript file with YouTube ID in filename
|
||||
complete_filename = f"{video_id}_complete_transcript.txt"
|
||||
complete_filepath = os.path.join(output_dir, complete_filename)
|
||||
|
||||
# Combine all chunks into single file
|
||||
with open(complete_filepath, "w", encoding="utf-8") as f:
|
||||
for i in range(1, len(transcript) + 1):
|
||||
chunk_file = os.path.join(chunks_dir, f"cc{i}.txt")
|
||||
if os.path.exists(chunk_file):
|
||||
with open(chunk_file, "r", encoding="utf-8") as chunk_f:
|
||||
f.write(chunk_f.read())
|
||||
|
||||
print(f"\nTotal segments: {len(transcript)}")
|
||||
print(f"Files saved in: {output_dir}/")
|
||||
print(f"Individual files saved in: {chunks_dir}/")
|
||||
print(f"Complete transcript saved as: {complete_filename}")
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user