BaseDownloader#
- class scikitplot.corpus.BaseDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)')[source]#
Abstract base class for all format-specific URL downloaders.
Mirrors the
DocumentReaderdesign — a@dataclassABC so all parameters are explicit and subclasses add only what they specialise.- Parameters:
- input_urlstr
Fully-qualified HTTP/HTTPS URL to download. Validated in
__post_init__.- output_pathpathlib.Path or None, optional
Directory to write the downloaded file into. If
None, a fresh temporary directory is created on the firstdownloadcall and owned by this instance (cleaned up oncleanup/ context-manager exit). Default:None.- timeoutfloat, optional
HTTP connection + read timeout in seconds. Default:
30.0.- max_bytesint, optional
Maximum acceptable download size in bytes. Downloads that exceed this limit are aborted and the partial file is deleted. Default:
100 * 1024 * 1024(100 MB).- verify_sslbool, optional
Verify TLS/SSL certificates. Never set to
Falsein production — doing so silently disables MITM protection. Default:True.- block_private_ipsbool, optional
Resolve the hostname before connecting and refuse to connect if any resolved address is RFC-1918 private, loopback, link-local, or reserved. This is the primary SSRF defence. Default:
True.- max_redirectsint, optional
Maximum number of HTTP 3xx redirects to follow. Default:
5.- user_agentstr, optional
Value for the
User-AgentHTTP request header. Default: scikitplot corpus bot string.
- Attributes:
- _tmp_dirpathlib.Path or None
Temporary directory created by this instance, if any.
Nonewhen output_path was supplied by the caller.
- Parameters:
See also
scikitplot.corpus._downloader._web.WebDownloaderGeneric HTTP/HTTPS downloader.
scikitplot.corpus._downloader._github.GitHubDownloaderGitHub blob / raw URL downloader with automatic normalisation.
scikitplot.corpus._downloader._gdrive.GoogleDriveDownloaderGoogle Drive share-link downloader.
scikitplot.corpus._downloader._youtube.YouTubeDownloaderYouTube transcript downloader.
scikitplot.corpus._downloader._downloader.AnyDownloaderAuto-dispatching downloader — routes to the correct specialist.
scikitplot.corpus._downloader._downloader.CustomDownloaderUser-supplied callable as a downloader.
Notes
Subclassing contract:
Decorate the subclass with
@dataclass.Call
super().__post_init__()explicitly (or rely on the MRO if using cooperative multiple inheritance).Override
downloadand callself._resolve_dest_dir()to obtain the write destination before streaming bytes to disk.Never log credentials (tokens, passwords) at any log level.
Security checklist enforced in
__post_init__:Scheme must be
httporhttps— nofile://,ftp://, etc.Hostname must not be empty.
(At download time) hostname is resolved and checked against private ranges when
block_private_ips=True.
Examples
Subclassing (minimal):
>>> @dataclass ... class EchoDownloader(BaseDownloader): ... def download(self) -> DownloadResult: ... dest = self._resolve_dest_dir() / "echo.txt" ... dest.write_text(self.input_url) ... return DownloadResult( ... input_url=self.input_url, output_path=dest, suffix=".txt" ... )
Context-manager usage (automatic cleanup):
>>> with WebDownloader("https://example.com/doc.pdf") as dl: ... result = dl.download() ... reader = DocumentReader.create(result.path)
- cleanup()[source]#
Remove the temporary directory owned by this instance, if any.
Safe to call multiple times. If
output_pathwas supplied at construction time (caller-owned), this method is a no-op.- Return type:
None
- abstractmethod download()[source]#
Download the resource and return a
DownloadResult.- Returns:
- DownloadResult
Populated result object.
result.pathis a readable local file; the caller must not delete it while using it.
- Raises:
- ValueError
On SSRF violation, size exceeded, unsupported scheme.
- OSError
On filesystem errors (no space, permission denied).
- urllib.error.URLError
On network errors.
- Return type: