AnyDownloader#

class scikitplot.corpus.AnyDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', youtube_mode='transcript', youtube_language='en', youtube_include_auto=True, github_token=None, headers=None, max_retries=3, retry_backoff=1.0)[source]#

Auto-dispatching downloader with multi-URL and per-parameter list support.

Accepts one URL or a list of URLs. All parameters support T | list[T] | None:

  • None → use the parameter’s built-in default for every URL.

  • Scalar → broadcast to every URL.

  • list → applied element-wise; must be the same length as input_url.

Parameters:
input_urlstr or list[str]

One URL or a list of URLs. When a list is supplied, download returns list[DownloadResult].

output_pathpathlib.Path or None, optional

Directory shared across all URLs. Default: None (temp dir).

timeoutfloat or list[float] or None, optional

HTTP timeout in seconds. Default: 30.0.

max_bytesint or list[int] or None, optional

Download size cap in bytes. Default: 100 MB.

verify_sslbool or list[bool] or None, optional

Verify TLS certificates. Default: True.

block_private_ipsbool or list[bool] or None, optional

SSRF prevention. Default: True.

max_redirectsint or list[int] or None, optional

Maximum HTTP redirects. Default: 5.

user_agentstr or list[str] or None, optional

User-Agent header value. Default: scikitplot UA string.

youtube_modestr or list[str] or None, optional

Mode for YouTubeDownloader: "transcript", "audio", or "video". Default: "transcript".

youtube_languagestr or list[str] or None, optional

BCP-47 language code for transcript fetching. Default: "en".

youtube_include_autobool or list[bool] or None, optional

Include auto-generated captions as fallback. Default: True.

github_tokenstr or list[str or None] or None, optional

PAT for GitHubDownloader (private repos). Per-URL None allowed. Never logged. Default: None.

headersdict or list[dict or None] or None, optional

Extra HTTP headers for WebDownloader. Per-URL None allowed. Default: None.

max_retriesint or list[int] or None, optional

Retry attempts for WebDownloader. Default: 3.

retry_backofffloat or list[float] or None, optional

Exponential back-off base for WebDownloader. Default: 1.0.

Parameters:

Notes

Single vs batch:

# Single URL — returns DownloadResult
result = AnyDownloader("https://example.com/paper.pdf").download()

# Batch — returns list[DownloadResult]
results = AnyDownloader(
    [
        "https://example.com/paper.pdf",
        "https://github.com/org/repo/blob/main/data.csv",
        "https://www.youtube.com/watch?v=abc123",
    ]
).download()

Per-URL parameters:

dl = AnyDownloader(
    input_url=[
        "https://github.com/org/priv/blob/main/secret.csv",
        "https://example.com/public.pdf",
    ],
    github_token=["ghp_token", None],  # None = public, no token needed
    timeout=[120.0, 30.0],  # per-URL timeouts
    max_bytes=200 * 1024 * 1024,  # broadcast to all
)
results = dl.download()

Examples

Single URL:

>>> dl = AnyDownloader("https://example.com/report.pdf")
>>> isinstance(dl.download(), DownloadResult)
True

Batch:

>>> dl = AnyDownloader(
...     ["https://example.com/a.pdf", "https://example.com/b.pdf"],
...     timeout=60.0,
... )
>>> len(dl.download())
2
block_private_ips: bool = True#
cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:

None

download()[source]#

Download one URL or all URLs and return the result(s).

Returns:
DownloadResult

When input_url was a single str.

list[DownloadResult]

When input_url was a list[str]. Preserves input order.

Notes

Batch downloads are sequential. For parallel execution, call download_single per URL in your own thread/process pool.

download_all()[source]#

Download all URLs and always return list[DownloadResult].

Normalises the return type so callers never need to branch on isinstance(result, list).

Returns:
list[DownloadResult]

One DownloadResult per URL, in input order.

Examples

>>> dl = AnyDownloader("https://example.com/doc.pdf")
>>> results = dl.download_all()
>>> len(results)
1
github_token: object = None#
headers: object = None#
input_url: str[source]#
max_bytes: int = 104857600#
max_redirects: int = 5#
max_retries: object = 3#
output_path: Path | None = None#
retry_backoff: object = 1.0#
timeout: float = 30.0#
user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#
verify_ssl: bool = True#
youtube_include_auto: object = True#
youtube_language: object = 'en'#
youtube_mode: object = 'transcript'#