WebDownloader#

class scikitplot.corpus.WebDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', max_retries=3, retry_backoff=1.0, headers=None)[source]#

Generic HTTP/HTTPS file downloader.

Delegates all network I/O, SSRF prevention, retry logic, and extension inference to download_url. The extra parameters on this class expose the full download_url surface area as explicit, named, introspectable attributes.

Parameters:
input_urlstr

HTTP/HTTPS URL to download.

output_pathpathlib.Path or None, optional

Directory for the downloaded file. Owned temp dir when None. Default: None.

timeoutfloat, optional

HTTP timeout in seconds. Default: 30.0.

max_bytesint, optional

Download size cap in bytes. Default: 100 MB.

verify_sslbool, optional

Verify TLS certificates. Default: True.

block_private_ipsbool, optional

SSRF prevention — block private/reserved IPs. Default: True.

max_redirectsint, optional

Maximum HTTP redirects. Default: 5.

user_agentstr, optional

User-Agent header value. Default: scikitplot UA string.

max_retriesint, optional

Maximum retry attempts for transient HTTP errors (429, 500, 502, 503, 504). Set to 0 to disable retries. Default: 3.

retry_backofffloat, optional

Base delay (seconds) for exponential back-off between retries. Actual wait before attempt n (0-indexed): retry_backoff * 2^n. Default: 1.0.

headersdict or None, optional

Additional HTTP request headers to merge with the default User-Agent. Useful for Authorization, Accept, etc. Default: None.

Parameters:
  • input_url (str)

  • output_path (Path | None)

  • timeout (float)

  • max_bytes (int)

  • verify_ssl (bool)

  • block_private_ips (bool)

  • max_redirects (int)

  • user_agent (str)

  • max_retries (int)

  • retry_backoff (float)

  • headers (dict | None)

Notes

When to use this vs AnyDownloader:

  • Use WebDownloader when you know the URL is a plain HTTP/HTTPS file (not GitHub blob, not GDrive, not YouTube) and want to control all parameters explicitly.

  • Use AnyDownloader when you receive an arbitrary URL and want automatic routing to the correct specialist.

SSL verification: Setting verify_ssl=False disables certificate validation entirely. This silently exposes the connection to MITM attacks. Only disable in controlled, trusted environments (e.g. local test servers with self-signed certs).

Examples

Simple download:

>>> dl = WebDownloader("https://example.com/paper.pdf")
>>> result = dl.download()
>>> result.suffix
'.pdf'

With custom timeout and size cap:

>>> dl = WebDownloader(
...     "https://example.com/bigfile.zip",
...     timeout=120.0,
...     max_bytes=500 * 1024 * 1024,
...     max_retries=5,
... )

Context-manager (auto-cleanup of temp dir):

>>> with WebDownloader("https://example.com/doc.pdf") as dl:
...     result = dl.download()
...     text = result.output_path.read_bytes()
block_private_ips: bool = True#
cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:

None

download()[source]#

Download the URL to a local file and return a DownloadResult.

Returns:
DownloadResult

Populated result with output_path, suffix, source_url, content_type, and suggested_filename.

Raises:
ValueError

If SSRF check fails, or download exceeds max_bytes.

urllib.error.URLError

If all retry attempts fail due to network errors.

OSError

If the destination directory cannot be created or written.

Return type:

DownloadResult

Notes

The SSRF check is applied before connecting. After a redirect chain, the final URL is re-validated against private IP ranges (guarded inside download_url via the requests path). Extension inference order:

  1. URL path extension (cheapest).

  2. Content-Disposition filename (RFC 5987 + plain form).

  3. Content-Type MIME mapping.

  4. Magic-byte detection on the downloaded file.

  5. .bin fallback.

headers: dict | None = None#
input_url: str[source]#
max_bytes: int = 104857600#
max_redirects: int = 5#
max_retries: int = 3#
output_path: Path | None = None#
retry_backoff: float = 1.0#
timeout: float = 30.0#
user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#
verify_ssl: bool = True#