187 lines
4.8 KiB
Markdown
187 lines
4.8 KiB
Markdown
# 🎬 YouTube Transcript to Numbered Files
|
|
|
|
Fixed scripts to download YouTube video transcripts and save each caption segment to separate numbered files (`cc1.txt`, `cc2.txt`, `cc3.txt`, etc.).
|
|
|
|
## ✅ What Was Fixed
|
|
|
|
The original script wrote all captions to a single `captions.txt` file. Now it:
|
|
- **Creates separate files** for each caption segment
|
|
- **Numbers files sequentially**: `cc1.txt`, `cc2.txt`, `cc3.txt`, etc.
|
|
- **Organizes output** in a dedicated directory
|
|
- **Handles errors** gracefully
|
|
- **Shows progress** during processing
|
|
|
|
## 📁 Available Scripts
|
|
|
|
### 1. `transcribe_yt_video.py` (Fixed Original)
|
|
The minimal fixed version of your original script.
|
|
|
|
```python
|
|
# Just change the video ID and run
|
|
video_id = "dQw4w9WgXcQ" # Replace with your video ID
|
|
```
|
|
|
|
### 2. `enhanced_yt_transcript.py` (Recommended)
|
|
Full-featured script with command-line interface and error handling.
|
|
|
|
## 🚀 Usage
|
|
|
|
### Quick Start (Fixed Original Script)
|
|
```bash
|
|
# Edit the video_id in the script, then run:
|
|
python transcribe_yt_video.py
|
|
```
|
|
|
|
### Advanced Usage (Enhanced Script)
|
|
```bash
|
|
# Using video ID
|
|
python enhanced_yt_transcript.py dQw4w9WgXcQ
|
|
|
|
# Using full YouTube URL
|
|
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
|
|
|
# Custom output directory
|
|
python enhanced_yt_transcript.py dQw4w9WgXcQ --output my_captions
|
|
|
|
# Specify preferred languages
|
|
python enhanced_yt_transcript.py dQw4w9WgXcQ --languages en es fr
|
|
|
|
# Get help
|
|
python enhanced_yt_transcript.py --help
|
|
```
|
|
|
|
## 📊 Output Structure
|
|
|
|
After running, you'll get:
|
|
```
|
|
captions/
|
|
├── cc1.txt # First caption segment
|
|
├── cc2.txt # Second caption segment
|
|
├── cc3.txt # Third caption segment
|
|
├── ...
|
|
├── cc150.txt # Last segment (example)
|
|
└── summary.txt # Summary information
|
|
```
|
|
|
|
Each `cc#.txt` file contains just the text from that caption segment.
|
|
|
|
## 🔧 Features
|
|
|
|
### Fixed Original Script
|
|
- ✅ **Separate files** for each caption segment
|
|
- ✅ **Sequential numbering** (cc1.txt, cc2.txt, etc.)
|
|
- ✅ **UTF-8 encoding** for international characters
|
|
- ✅ **Progress feedback** showing what's being written
|
|
|
|
### Enhanced Script
|
|
- ✅ **Command-line interface** - no need to edit code
|
|
- ✅ **URL parsing** - accepts YouTube URLs or video IDs
|
|
- ✅ **Language selection** - prefer specific languages
|
|
- ✅ **Error handling** - graceful failures with helpful messages
|
|
- ✅ **Progress tracking** - shows processing status
|
|
- ✅ **Summary file** - metadata about the download
|
|
- ✅ **Directory cleanup** - removes old files before new download
|
|
|
|
## 📋 Requirements
|
|
|
|
Install the required package:
|
|
```bash
|
|
pip install youtube-transcript-api
|
|
```
|
|
|
|
## 💡 Usage Examples
|
|
|
|
### Example 1: Educational Video
|
|
```bash
|
|
python enhanced_yt_transcript.py "https://www.youtube.com/watch?v=VIDEO_ID" --output lecture_notes
|
|
```
|
|
|
|
### Example 2: Multi-language Content
|
|
```bash
|
|
python enhanced_yt_transcript.py VIDEO_ID --languages en es --output multilang_captions
|
|
```
|
|
|
|
### Example 3: Quick Processing
|
|
```bash
|
|
python enhanced_yt_transcript.py VIDEO_ID
|
|
# Creates captions/cc1.txt, captions/cc2.txt, etc.
|
|
```
|
|
|
|
## 🔍 Output Preview
|
|
|
|
When the script runs, you'll see:
|
|
```
|
|
🎬 Processing video ID: dQw4w9WgXcQ
|
|
✅ Found auto-generated or default transcript
|
|
📁 Created directory: captions
|
|
📝 Writing 156 segments...
|
|
📄 cc10.txt: Never gonna give you up, never gonna let you...
|
|
📄 cc20.txt: We've known each other for so long...
|
|
📄 cc30.txt: Your heart's been aching but you're too shy...
|
|
|
|
🎉 Success!
|
|
📊 Total segments: 156
|
|
📁 Files saved in: /full/path/to/captions/
|
|
📋 Summary saved to: captions/summary.txt
|
|
```
|
|
|
|
## 🛠️ Troubleshooting
|
|
|
|
### Common Issues:
|
|
|
|
1. **"No transcript found"**
|
|
- Video might not have captions/transcripts
|
|
- Try a different video with confirmed captions
|
|
|
|
2. **"Transcripts are disabled"**
|
|
- Video owner disabled transcripts
|
|
- Try a different video
|
|
|
|
3. **Module not found**
|
|
```bash
|
|
pip install youtube-transcript-api
|
|
```
|
|
|
|
### Testing:
|
|
Use the test script to verify everything works:
|
|
```bash
|
|
python test_transcript.py
|
|
```
|
|
(Remember to replace `TEST_VIDEO_ID` with a real video ID)
|
|
|
|
## 📝 File Contents Example
|
|
|
|
**cc1.txt:**
|
|
```
|
|
Welcome to this tutorial
|
|
```
|
|
|
|
**cc2.txt:**
|
|
```
|
|
Today we'll be learning about
|
|
```
|
|
|
|
**cc3.txt:**
|
|
```
|
|
the basics of programming
|
|
```
|
|
|
|
**summary.txt:**
|
|
```
|
|
YouTube Video ID: dQw4w9WgXcQ
|
|
Total segments: 156
|
|
Files: cc1.txt to cc156.txt
|
|
Generated: enhanced_yt_transcript.py
|
|
```
|
|
|
|
## 🎯 Perfect For:
|
|
|
|
- **Content analysis** - process each caption separately
|
|
- **AI training data** - individual text segments
|
|
- **Research projects** - granular transcript analysis
|
|
- **Content creation** - extract specific quotes/segments
|
|
- **Translation work** - process segments individually
|
|
|
|
---
|
|
|
|
**The script is now fixed to write each caption segment to separate `cc#.txt` files as requested!** 🎉 |