BaseDownloader#

class scikitplot.corpus.BaseDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)')[source]#

Abstract base class for all format-specific URL downloaders.

Mirrors the DocumentReader design — a @dataclass ABC so all parameters are explicit and subclasses add only what they specialise.

Parameters:

input_urlstr: Fully-qualified HTTP/HTTPS URL to download. Validated in __post_init__.
output_pathpathlib.Path or None, optional: Directory to write the downloaded file into. If None, a fresh temporary directory is created on the first download call and owned by this instance (cleaned up on cleanup / context-manager exit). Default: None.
timeoutfloat, optional: HTTP connection + read timeout in seconds. Default: 30.0.
max_bytesint, optional: Maximum acceptable download size in bytes. Downloads that exceed this limit are aborted and the partial file is deleted. Default: 100 * 1024 * 1024 (100 MB).
verify_sslbool, optional: Verify TLS/SSL certificates. Never set to False in production — doing so silently disables MITM protection. Default: True.
block_private_ipsbool, optional: Resolve the hostname before connecting and refuse to connect if any resolved address is RFC-1918 private, loopback, link-local, or reserved. This is the primary SSRF defence. Default: True.
max_redirectsint, optional: Maximum number of HTTP 3xx redirects to follow. Default: 5.
user_agentstr, optional: Value for the User-Agent HTTP request header. Default: scikitplot corpus bot string.

Attributes:

_tmp_dirpathlib.Path or None: Temporary directory created by this instance, if any. None when output_path was supplied by the caller.

Parameters:

input_url (str)
output_path (Path | None)
timeout (float)
max_bytes (int)
verify_ssl (bool)
block_private_ips (bool)
max_redirects (int)
user_agent (str)

See also

scikitplot.corpus._downloader._web.WebDownloader: Generic HTTP/HTTPS downloader.
scikitplot.corpus._downloader._github.GitHubDownloader: GitHub blob / raw URL downloader with automatic normalisation.
scikitplot.corpus._downloader._gdrive.GoogleDriveDownloader: Google Drive share-link downloader.
scikitplot.corpus._downloader._youtube.YouTubeDownloader: YouTube transcript downloader.
scikitplot.corpus._downloader._downloader.AnyDownloader: Auto-dispatching downloader — routes to the correct specialist.
scikitplot.corpus._downloader._downloader.CustomDownloader: User-supplied callable as a downloader.

Notes

Subclassing contract:

Decorate the subclass with @dataclass.
Call super().__post_init__() explicitly (or rely on the MRO if using cooperative multiple inheritance).
Override download and call self._resolve_dest_dir() to obtain the write destination before streaming bytes to disk.
Never log credentials (tokens, passwords) at any log level.

Security checklist enforced in __post_init__:

Scheme must be http or https — no file://, ftp://, etc.
Hostname must not be empty.
(At download time) hostname is resolved and checked against private ranges when block_private_ips=True.

Examples

Subclassing (minimal):

>>> @dataclass
... class EchoDownloader(BaseDownloader):
...     def download(self) -> DownloadResult:
...         dest = self._resolve_dest_dir() / "echo.txt"
...         dest.write_text(self.input_url)
...         return DownloadResult(
...             input_url=self.input_url, output_path=dest, suffix=".txt"
...         )

Context-manager usage (automatic cleanup):

>>> with WebDownloader("https://example.com/doc.pdf") as dl:
...     result = dl.download()
...     reader = DocumentReader.create(result.path)

block_private_ips: bool = True#

cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:: None

abstractmethod download()[source]#

Download the resource and return a DownloadResult.

Returns:

DownloadResult: Populated result object. result.path is a readable local file; the caller must not delete it while using it.

Raises:

ValueError: On SSRF violation, size exceeded, unsupported scheme.
OSError: On filesystem errors (no space, permission denied).
urllib.error.URLError: On network errors.

Return type:

DownloadResult

input_url: str[source]#

max_bytes: int = 104857600#

max_redirects: int = 5#

output_path: Path | None = None#

timeout: float = 30.0#

user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#

verify_ssl: bool = True#