WebDownloader#

class scikitplot.corpus.WebDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', max_retries=3, retry_backoff=1.0, headers=None)[source]#

Generic HTTP/HTTPS file downloader.

Delegates all network I/O, SSRF prevention, retry logic, and extension inference to download_url. The extra parameters on this class expose the full download_url surface area as explicit, named, introspectable attributes.

Parameters:

input_urlstr: HTTP/HTTPS URL to download.
output_pathpathlib.Path or None, optional: Directory for the downloaded file. Owned temp dir when None. Default: None.
timeoutfloat, optional: HTTP timeout in seconds. Default: 30.0.
max_bytesint, optional: Download size cap in bytes. Default: 100 MB.
verify_sslbool, optional: Verify TLS certificates. Default: True.
block_private_ipsbool, optional: SSRF prevention — block private/reserved IPs. Default: True.
max_redirectsint, optional: Maximum HTTP redirects. Default: 5.
user_agentstr, optional: User-Agent header value. Default: scikitplot UA string.
max_retriesint, optional: Maximum retry attempts for transient HTTP errors (429, 500, 502, 503, 504). Set to 0 to disable retries. Default: 3.
retry_backofffloat, optional: Base delay (seconds) for exponential back-off between retries. Actual wait before attempt n (0-indexed): retry_backoff * 2^n. Default: 1.0.
headersdict or None, optional: Additional HTTP request headers to merge with the default User-Agent. Useful for Authorization, Accept, etc. Default: None.

Parameters:

input_url (str)
output_path (Path | None)
timeout (float)
max_bytes (int)
verify_ssl (bool)
block_private_ips (bool)
max_redirects (int)
user_agent (str)
max_retries (int)
retry_backoff (float)
headers (dict | None)

Notes

When to use this vs AnyDownloader:

Use WebDownloader when you know the URL is a plain HTTP/HTTPS file (not GitHub blob, not GDrive, not YouTube) and want to control all parameters explicitly.
Use AnyDownloader when you receive an arbitrary URL and want automatic routing to the correct specialist.

SSL verification: Setting verify_ssl=False disables certificate validation entirely. This silently exposes the connection to MITM attacks. Only disable in controlled, trusted environments (e.g. local test servers with self-signed certs).

Examples

Simple download:

>>> dl = WebDownloader("https://example.com/paper.pdf")
>>> result = dl.download()
>>> result.suffix
'.pdf'

With custom timeout and size cap:

>>> dl = WebDownloader(
...     "https://example.com/bigfile.zip",
...     timeout=120.0,
...     max_bytes=500 * 1024 * 1024,
...     max_retries=5,
... )

Context-manager (auto-cleanup of temp dir):

>>> with WebDownloader("https://example.com/doc.pdf") as dl:
...     result = dl.download()
...     text = result.output_path.read_bytes()

block_private_ips: bool = True#

cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:: None

download()[source]#

Download the URL to a local file and return a DownloadResult.

Returns:

DownloadResult: Populated result with output_path, suffix, source_url, content_type, and suggested_filename.

Raises:

ValueError: If SSRF check fails, or download exceeds max_bytes.
urllib.error.URLError: If all retry attempts fail due to network errors.
OSError: If the destination directory cannot be created or written.

Return type:

DownloadResult

Notes

The SSRF check is applied before connecting. After a redirect chain, the final URL is re-validated against private IP ranges (guarded inside download_url via the requests path). Extension inference order:

URL path extension (cheapest).
Content-Disposition filename (RFC 5987 + plain form).
Content-Type MIME mapping.
Magic-byte detection on the downloaded file.
.bin fallback.

headers: dict | None = None#

input_url: str[source]#

max_bytes: int = 104857600#

max_redirects: int = 5#

max_retries: int = 3#

output_path: Path | None = None#

retry_backoff: float = 1.0#

timeout: float = 30.0#

user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#

verify_ssl: bool = True#