AnyDownloader#

class scikitplot.corpus.AnyDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', youtube_mode='transcript', youtube_language='en', youtube_include_auto=True, github_token=None, headers=None, max_retries=3, retry_backoff=1.0)[source]#

Auto-dispatching downloader with multi-URL and per-parameter list support.

Accepts one URL or a list of URLs. All parameters support T | list[T] | None:

None → use the parameter’s built-in default for every URL.
Scalar → broadcast to every URL.
list → applied element-wise; must be the same length as input_url.

Parameters:

input_urlstr or list[str]: One URL or a list of URLs. When a list is supplied, download returns list[DownloadResult].
output_pathpathlib.Path or None, optional: Directory shared across all URLs. Default: None (temp dir).
timeoutfloat or list[float] or None, optional: HTTP timeout in seconds. Default: 30.0.
max_bytesint or list[int] or None, optional: Download size cap in bytes. Default: 100 MB.
verify_sslbool or list[bool] or None, optional: Verify TLS certificates. Default: True.
block_private_ipsbool or list[bool] or None, optional: SSRF prevention. Default: True.
max_redirectsint or list[int] or None, optional: Maximum HTTP redirects. Default: 5.
user_agentstr or list[str] or None, optional: User-Agent header value. Default: scikitplot UA string.
youtube_modestr or list[str] or None, optional: Mode for YouTubeDownloader: "transcript", "audio", or "video". Default: "transcript".
youtube_languagestr or list[str] or None, optional: BCP-47 language code for transcript fetching. Default: "en".
youtube_include_autobool or list[bool] or None, optional: Include auto-generated captions as fallback. Default: True.
github_tokenstr or list[str or None] or None, optional: PAT for GitHubDownloader (private repos). Per-URL None allowed. Never logged. Default: None.
headersdict or list[dict or None] or None, optional: Extra HTTP headers for WebDownloader. Per-URL None allowed. Default: None.
max_retriesint or list[int] or None, optional: Retry attempts for WebDownloader. Default: 3.
retry_backofffloat or list[float] or None, optional: Exponential back-off base for WebDownloader. Default: 1.0.

Parameters:

input_url (str)
output_path (Path | None)
timeout (float)
max_bytes (int)
verify_ssl (bool)
block_private_ips (bool)
max_redirects (int)
user_agent (str)
youtube_mode (object)
youtube_language (object)
youtube_include_auto (object)
github_token (object)
headers (object)
max_retries (object)
retry_backoff (object)

Notes

Single vs batch:

# Single URL — returns DownloadResult
result = AnyDownloader("https://example.com/paper.pdf").download()

# Batch — returns list[DownloadResult]
results = AnyDownloader(
    [
        "https://example.com/paper.pdf",
        "https://github.com/org/repo/blob/main/data.csv",
        "https://www.youtube.com/watch?v=abc123",
    ]
).download()

Per-URL parameters:

dl = AnyDownloader(
    input_url=[
        "https://github.com/org/priv/blob/main/secret.csv",
        "https://example.com/public.pdf",
    ],
    github_token=["ghp_token", None],  # None = public, no token needed
    timeout=[120.0, 30.0],  # per-URL timeouts
    max_bytes=200 * 1024 * 1024,  # broadcast to all
)
results = dl.download()

Examples

Single URL:

>>> dl = AnyDownloader("https://example.com/report.pdf")
>>> isinstance(dl.download(), DownloadResult)
True

Batch:

>>> dl = AnyDownloader(
...     ["https://example.com/a.pdf", "https://example.com/b.pdf"],
...     timeout=60.0,
... )
>>> len(dl.download())
2

block_private_ips: bool = True#

cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:: None

download()[source]#

Download one URL or all URLs and return the result(s).

Returns:

DownloadResult: When input_url was a single str.
list[DownloadResult]: When input_url was a list[str]. Preserves input order.

Notes

Batch downloads are sequential. For parallel execution, call download_single per URL in your own thread/process pool.

download_all()[source]#

Download all URLs and always return list[DownloadResult].

Normalises the return type so callers never need to branch on isinstance(result, list).

Returns:

list[DownloadResult]: One DownloadResult per URL, in input order.

Examples

>>> dl = AnyDownloader("https://example.com/doc.pdf")
>>> results = dl.download_all()
>>> len(results)
1

github_token: object = None#

headers: object = None#

input_url: str[source]#

max_bytes: int = 104857600#

max_redirects: int = 5#

max_retries: object = 3#

output_path: Path | None = None#

retry_backoff: object = 1.0#

timeout: float = 30.0#

user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#

verify_ssl: bool = True#

youtube_include_auto: object = True#

youtube_language: object = 'en'#

youtube_mode: object = 'transcript'#