GitHubDownloader#
- class scikitplot.corpus.GitHubDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', token=None)[source]#
GitHub URL downloader with automatic blob → raw normalisation.
Accepts both
github.com/.../blob/...andraw.githubusercontent.com/...URLs. Blob URLs are silently rewritten to their raw equivalent before downloading.- Parameters:
- input_urlstr
GitHub blob or raw URL. Accepted forms:
https://github.com/OWNER/REPO/blob/REF/path/to/filehttps://raw.githubusercontent.com/OWNER/REPO/REF/path/to/filehttps://raw.githubusercontent.com/OWNER/REPO/refs/heads/BRANCH/path
- tokenstr or None, optional
GitHub personal access token (PAT) or fine-grained token. When provided, sent as
Authorization: Bearer <token>so that private repositories can be accessed. Never logged or included in repr. Default:None(anonymous access, public repos only).- output_pathpathlib.Path or None, optional
Directory for the downloaded file. Default:
None(temp dir).- timeoutfloat, optional
HTTP timeout in seconds. Default:
30.0.- max_bytesint, optional
Download size cap in bytes. Default:
100 MB.- verify_sslbool, optional
Verify TLS certificates. Default:
True.- block_private_ipsbool, optional
SSRF prevention. Default:
True.- max_redirectsint, optional
Maximum HTTP redirects. Default:
5.
- Raises:
- ValueError
If the URL is not a recognised GitHub blob or raw URL at construction time.
- ValueError
If the URL points to a directory tree (
/tree/), which is not a downloadable file.
- Parameters:
Notes
Blob → raw rewrite rule:
https://github.com/OWNER/REPO/blob/REF/path/to/file.md ↓ https://raw.githubusercontent.com/OWNER/REPO/REF/path/to/file.mdThe
refs/heads/prefix used by the GitHub UI for branch refs is preserved when already present in raw URLs and not added for blob URLs (blob URLs do not carry it).Private repo access: Tokens are passed as HTTP headers, never as URL query parameters. Tokens are redacted from all log output.
Examples
Public repo — blob URL:
>>> dl = GitHubDownloader( ... "https://github.com/scikit-plots/scikit-plots/blob/main/README.md" ... ) >>> result = dl.download() >>> result.suffix '.md'
Public repo — raw URL:
>>> dl = GitHubDownloader( ... "https://raw.githubusercontent.com/scikit-plots/scikit-plots" ... "/refs/heads/main/README.md" ... ) >>> result = dl.download()
Private repo with PAT:
>>> dl = GitHubDownloader( ... "https://github.com/myorg/private-repo/blob/main/data.csv", ... token="ghp_xxxxxxxxxxxxxxxxxxxx", ... ) >>> result = dl.download()
- cleanup()[source]#
Remove the temporary directory owned by this instance, if any.
Safe to call multiple times. If
output_pathwas supplied at construction time (caller-owned), this method is a no-op.- Return type:
None
- download()[source]#
Download the GitHub file and return a
DownloadResult.The blob URL (if given) is normalised to a raw URL first, then downloaded via
download_urlwith an optionalAuthorizationheader for private repos.- Returns:
- DownloadResult
Populated result with local file path, extension, and source URL.
- Raises:
- ValueError
If SSRF check fails or size exceeds
max_bytes.- urllib.error.URLError
On network errors.
- Return type:
Notes
The
source_urlin the returnedDownloadResultis always the original URL passed at construction time, not the resolved raw URL. This preserves the provenance label shown to end users.
- resolve_raw_url()[source]#
Normalise a GitHub blob URL to its raw.githubusercontent.com equivalent.
- Returns:
- str
Raw content URL. If the input is already a raw URL, it is returned unchanged.
- Return type:
Examples
>>> dl = GitHubDownloader("https://github.com/user/repo/blob/main/data.csv") >>> dl.resolve_raw_url() 'https://raw.githubusercontent.com/user/repo/main/data.csv'
>>> dl2 = GitHubDownloader( ... "https://raw.githubusercontent.com/user/repo/main/data.csv" ... ) >>> dl2.resolve_raw_url() == dl2.input_url True