WebDownloader#
- class scikitplot.corpus.WebDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', max_retries=3, retry_backoff=1.0, headers=None)[source]#
Generic HTTP/HTTPS file downloader.
Delegates all network I/O, SSRF prevention, retry logic, and extension inference to
download_url. The extra parameters on this class expose the fulldownload_urlsurface area as explicit, named, introspectable attributes.- Parameters:
- input_urlstr
HTTP/HTTPS URL to download.
- output_pathpathlib.Path or None, optional
Directory for the downloaded file. Owned temp dir when
None. Default:None.- timeoutfloat, optional
HTTP timeout in seconds. Default:
30.0.- max_bytesint, optional
Download size cap in bytes. Default:
100 MB.- verify_sslbool, optional
Verify TLS certificates. Default:
True.- block_private_ipsbool, optional
SSRF prevention — block private/reserved IPs. Default:
True.- max_redirectsint, optional
Maximum HTTP redirects. Default:
5.- user_agentstr, optional
User-Agentheader value. Default: scikitplot UA string.- max_retriesint, optional
Maximum retry attempts for transient HTTP errors (429, 500, 502, 503, 504). Set to
0to disable retries. Default:3.- retry_backofffloat, optional
Base delay (seconds) for exponential back-off between retries. Actual wait before attempt n (0-indexed):
retry_backoff * 2^n. Default:1.0.- headersdict or None, optional
Additional HTTP request headers to merge with the default
User-Agent. Useful forAuthorization,Accept, etc. Default:None.
- Parameters:
Notes
When to use this vs
AnyDownloader:Use
WebDownloaderwhen you know the URL is a plain HTTP/HTTPS file (not GitHub blob, not GDrive, not YouTube) and want to control all parameters explicitly.Use
AnyDownloaderwhen you receive an arbitrary URL and want automatic routing to the correct specialist.
SSL verification: Setting
verify_ssl=Falsedisables certificate validation entirely. This silently exposes the connection to MITM attacks. Only disable in controlled, trusted environments (e.g. local test servers with self-signed certs).Examples
Simple download:
>>> dl = WebDownloader("https://example.com/paper.pdf") >>> result = dl.download() >>> result.suffix '.pdf'
With custom timeout and size cap:
>>> dl = WebDownloader( ... "https://example.com/bigfile.zip", ... timeout=120.0, ... max_bytes=500 * 1024 * 1024, ... max_retries=5, ... )
Context-manager (auto-cleanup of temp dir):
>>> with WebDownloader("https://example.com/doc.pdf") as dl: ... result = dl.download() ... text = result.output_path.read_bytes()
- cleanup()[source]#
Remove the temporary directory owned by this instance, if any.
Safe to call multiple times. If
output_pathwas supplied at construction time (caller-owned), this method is a no-op.- Return type:
None
- download()[source]#
Download the URL to a local file and return a
DownloadResult.- Returns:
- DownloadResult
Populated result with
output_path,suffix,source_url,content_type, andsuggested_filename.
- Raises:
- ValueError
If SSRF check fails, or download exceeds
max_bytes.- urllib.error.URLError
If all retry attempts fail due to network errors.
- OSError
If the destination directory cannot be created or written.
- Return type:
Notes
The SSRF check is applied before connecting. After a redirect chain, the final URL is re-validated against private IP ranges (guarded inside
download_urlvia therequestspath). Extension inference order:URL path extension (cheapest).
Content-Dispositionfilename (RFC 5987 + plain form).Content-TypeMIME mapping.Magic-byte detection on the downloaded file.
.binfallback.