YouTubeDownloader#

class scikitplot.corpus.YouTubeDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', mode='transcript', language='en', include_auto_generated=True)[source]#

YouTube content downloader.

Downloads a transcript, audio track, or video from a single YouTube video URL. The mode parameter selects what is fetched.

Parameters:

input_urlstr

YouTube video URL. Accepted forms:

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/shorts/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID

mode{“transcript”, “audio”, “video”}, optional

What to download. Default: "transcript".

languagestr, optional

BCP-47 language code for transcript fetching (e.g. "en", "fr", "de"). Falls back to auto-generated captions when the requested language is not available. Only used for mode="transcript". Default: "en".

include_auto_generatedbool, optional

When True, include auto-generated transcripts as a fallback when no human-reviewed captions exist. Default: True.

output_pathpathlib.Path or None, optional

Directory for the downloaded file. Default: None (temp dir).

timeoutfloat, optional

HTTP timeout in seconds (transcript fetch and yt-dlp). Default: 30.0.

max_bytesint, optional

Download size cap in bytes (audio/video modes only; transcripts are always small). Default: 100 MB.

Raises:

ValueError: If the URL is not a recognised YouTube single-video URL.
ValueError: If mode is not one of "transcript", "audio", "video".

Parameters:

input_url (str)
output_path (Path | None)
timeout (float)
max_bytes (int)
verify_ssl (bool)
block_private_ips (bool)
max_redirects (int)
user_agent (str)
mode (Literal['transcript', 'audio', 'video'])
language (str)
include_auto_generated (bool)

Notes

Transcript mode uses youtube-transcript-api (pip-installable, lightweight, no browser). It writes a plain .txt file where each caption segment is a line.

Audio/video modes require yt-dlp (pip install yt-dlp). They invoke yt-dlp programmatically via its Python API.

Channels and playlists are not supported — pass a single video URL.

SSRF prevention is always applied for audio/video modes (network calls made by yt-dlp go to YouTube CDN, which is public; the check is a defence-in-depth measure). Transcript mode makes its own HTTP calls which are also validated.

Examples

Transcript (default):

>>> dl = YouTubeDownloader("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> result = dl.download()
>>> result.suffix
'.txt'

Audio download:

>>> dl = YouTubeDownloader(
...     "https://youtu.be/rwPISgZcYIk",
...     mode="audio",
... )
>>> result = dl.download()
>>> result.suffix in (".mp3", ".m4a", ".webm")
True

block_private_ips: bool = True#

cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:: None

download()[source]#

Download the requested content and return a DownloadResult.

Dispatches to _download_transcript, _download_audio, or _download_video based on self.mode.

Returns:

DownloadResult: Populated result with local file path, extension, source URL.

Raises:

ImportError: If youtube-transcript-api (transcript mode) or yt-dlp (audio/video modes) is not installed.
ValueError: If the transcript is not available for the given video/language.

Return type:

DownloadResult

include_auto_generated: bool = True#

input_url: str[source]#

language: str = 'en'#

max_bytes: int = 104857600#

max_redirects: int = 5#

mode: Literal['transcript', 'audio', 'video'] = 'transcript'#

output_path: Path | None = None#

timeout: float = 30.0#

user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#

verify_ssl: bool = True#