GitHubDownloader#

class scikitplot.corpus.GitHubDownloader(input_url, output_path=None, timeout=30.0, max_bytes=104857600, verify_ssl=True, block_private_ips=True, max_redirects=5, user_agent='Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)', token=None)[source]#

GitHub URL downloader with automatic blob → raw normalisation.

Accepts both github.com/.../blob/... and raw.githubusercontent.com/... URLs. Blob URLs are silently rewritten to their raw equivalent before downloading.

Parameters:
input_urlstr

GitHub blob or raw URL. Accepted forms:

  • https://github.com/OWNER/REPO/blob/REF/path/to/file

  • https://raw.githubusercontent.com/OWNER/REPO/REF/path/to/file

  • https://raw.githubusercontent.com/OWNER/REPO/refs/heads/BRANCH/path

tokenstr or None, optional

GitHub personal access token (PAT) or fine-grained token. When provided, sent as Authorization: Bearer <token> so that private repositories can be accessed. Never logged or included in repr. Default: None (anonymous access, public repos only).

output_pathpathlib.Path or None, optional

Directory for the downloaded file. Default: None (temp dir).

timeoutfloat, optional

HTTP timeout in seconds. Default: 30.0.

max_bytesint, optional

Download size cap in bytes. Default: 100 MB.

verify_sslbool, optional

Verify TLS certificates. Default: True.

block_private_ipsbool, optional

SSRF prevention. Default: True.

max_redirectsint, optional

Maximum HTTP redirects. Default: 5.

Raises:
ValueError

If the URL is not a recognised GitHub blob or raw URL at construction time.

ValueError

If the URL points to a directory tree (/tree/), which is not a downloadable file.

Parameters:
  • input_url (str)

  • output_path (Path | None)

  • timeout (float)

  • max_bytes (int)

  • verify_ssl (bool)

  • block_private_ips (bool)

  • max_redirects (int)

  • user_agent (str)

  • token (str | None)

Notes

Blob → raw rewrite rule:

https://github.com/OWNER/REPO/blob/REF/path/to/file.md
      ↓
https://raw.githubusercontent.com/OWNER/REPO/REF/path/to/file.md

The refs/heads/ prefix used by the GitHub UI for branch refs is preserved when already present in raw URLs and not added for blob URLs (blob URLs do not carry it).

Private repo access: Tokens are passed as HTTP headers, never as URL query parameters. Tokens are redacted from all log output.

Examples

Public repo — blob URL:

>>> dl = GitHubDownloader(
...     "https://github.com/scikit-plots/scikit-plots/blob/main/README.md"
... )
>>> result = dl.download()
>>> result.suffix
'.md'

Public repo — raw URL:

>>> dl = GitHubDownloader(
...     "https://raw.githubusercontent.com/scikit-plots/scikit-plots"
...     "/refs/heads/main/README.md"
... )
>>> result = dl.download()

Private repo with PAT:

>>> dl = GitHubDownloader(
...     "https://github.com/myorg/private-repo/blob/main/data.csv",
...     token="ghp_xxxxxxxxxxxxxxxxxxxx",
... )
>>> result = dl.download()
block_private_ips: bool = True#
cleanup()[source]#

Remove the temporary directory owned by this instance, if any.

Safe to call multiple times. If output_path was supplied at construction time (caller-owned), this method is a no-op.

Return type:

None

download()[source]#

Download the GitHub file and return a DownloadResult.

The blob URL (if given) is normalised to a raw URL first, then downloaded via download_url with an optional Authorization header for private repos.

Returns:
DownloadResult

Populated result with local file path, extension, and source URL.

Raises:
ValueError

If SSRF check fails or size exceeds max_bytes.

urllib.error.URLError

On network errors.

Return type:

DownloadResult

Notes

The source_url in the returned DownloadResult is always the original URL passed at construction time, not the resolved raw URL. This preserves the provenance label shown to end users.

input_url: str[source]#
max_bytes: int = 104857600#
max_redirects: int = 5#
output_path: Path | None = None#
resolve_raw_url()[source]#

Normalise a GitHub blob URL to its raw.githubusercontent.com equivalent.

Returns:
str

Raw content URL. If the input is already a raw URL, it is returned unchanged.

Return type:

str

Examples

>>> dl = GitHubDownloader("https://github.com/user/repo/blob/main/data.csv")
>>> dl.resolve_raw_url()
'https://raw.githubusercontent.com/user/repo/main/data.csv'
>>> dl2 = GitHubDownloader(
...     "https://raw.githubusercontent.com/user/repo/main/data.csv"
... )
>>> dl2.resolve_raw_url() == dl2.input_url
True
timeout: float = 30.0#
token: str | None = None#
user_agent: str = 'Mozilla/5.0 (compatible; scikitplot-corpus/1.0; +https://github.com/scikit-plots/scikit-plots)'#
verify_ssl: bool = True#