BaseDownloader#

class scikitplot.corpus.BaseDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)')[source]#

Abstract base class for all format-specific URL downloaders.

Mirrors the DocumentReader design — a @dataclass ABC so all parameters are explicit and subclasses add only what they specialise.

Parameters:
input_urlstr

Fully-qualified HTTP/HTTPS URL to download. Validated in __post_init__.

output_pathpathlib.Path or None, optional

Directory to write the downloaded file into. If None, a fresh temporary directory is created on the first download call and owned by this instance (cleaned up on cleanup / context-manager exit). Default: None.

timeoutfloat, optional

HTTP connection + read timeout in seconds. Default: 30.0.

max_bytesint, optional

Maximum acceptable download size in bytes. Downloads that exceed this limit are aborted and the partial file is deleted. Default: 100 * 1024 * 1024 (100 MB).

verify_sslbool, optional

Verify TLS/SSL certificates. Never set to False in production — doing so silently disables MITM protection. Default: True.

block_private_ipsbool, optional

Resolve the hostname before connecting and refuse to connect if any resolved address is RFC-1918 private, loopback, link-local, or reserved. This is the primary SSRF defence. Default: True.

max_redirectsint, optional

Maximum number of HTTP 3xx redirects to follow. Default: 5.

user_agentstr, optional

Value for the User-Agent HTTP request header. Default: scikitplot corpus bot string.

Attributes:
_tmp_dirpathlib.Path or None

Temporary directory created by this instance, if any. None when output_path was supplied by the caller.

Parameters:
  • input_url (str)

  • output_path (Path | None)

  • timeout (float)

  • max_bytes (int)

  • verify_ssl (bool)

  • block_private_ips (bool)

  • max_redirects (int)

  • user_agent (str)

See also

scikitplot.corpus._downloader._web.WebDownloader

Generic HTTP/HTTPS downloader.

scikitplot.corpus._downloader._github.GitHubDownloader

GitHub blob / raw URL downloader with automatic normalisation.

scikitplot.corpus._downloader._gdrive.GoogleDriveDownloader

Google Drive share-link downloader.

scikitplot.corpus._downloader._youtube.YouTubeDownloader

YouTube transcript downloader.

scikitplot.corpus._downloader._downloader.AnyDownloader

Auto-dispatching downloader — routes to the correct specialist.

scikitplot.corpus._downloader._downloader.CustomDownloader

User-supplied callable as a downloader.

Notes

Subclassing contract:

  1. Decorate the subclass with @dataclass.

  2. Call super().__post_init__() explicitly (or rely on the MRO if using cooperative multiple inheritance).

  3. Override download and call self._resolve_dest_dir() to obtain the write destination before streaming bytes to disk.

  4. Never log credentials (tokens, passwords) at any log level.

Security checklist enforced in __post_init__:

  • Scheme must be http or https — no file://, ftp://, etc.

  • Hostname must not be empty.

  • (At download time) hostname is resolved and checked against private ranges when block_private_ips=True.

Examples

Subclassing (minimal):

>>> @dataclass
... class EchoDownloader(BaseDownloader):
...     def download(self) -> DownloadResult:
...         dest = self._resolve_dest_dir() / "echo.txt"
...         dest.write_text(self.input_url)
...         return DownloadResult(
...             input_url=self.input_url, output_path=dest, suffix=".txt"
...         )

Context-manager usage (automatic cleanup):

>>> with WebDownloader("https://example.com/doc.pdf") as dl:
...     result = dl.download()
...     reader = DocumentReader.create(result.path)
block_private_ips: bool = True#
cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:

None

abstractmethod download()[source]#

Download the resource and return a DownloadResult.

Returns:
DownloadResult

Populated result object. result.path is a readable local file; the caller must not delete it while using it.

Raises:
ValueError

On SSRF violation, size exceeded, unsupported scheme.

OSError

On filesystem errors (no space, permission denied).

urllib.error.URLError

On network errors.

Return type:

DownloadResult

input_url: str[source]#
max_bytes: int = 104857600#
max_redirects: int = 5#
output_path: Path | None = None#
timeout: float = 30.0#
user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#
verify_ssl: bool = True#