probe_url_kind#

scikitplot.corpus.probe_url_kind(url, *, timeout=15, skip_ssrf_check=False)[source]#

Probe a URL with a HEAD request to classify by Content-Type.

Use this when classify_url returns URLKind.WEB_PAGE but the URL has no file extension in the path (e.g. API endpoints like /content, /download, /bitstream). A HEAD request is sent first; if that fails a GET with stream=True is attempted (some servers reject HEAD). The Content-Type response header is read and mapped to the correct URLKind.

Parameters:

urlstr: HTTP/HTTPS URL to probe.
timeoutint, optional: Connection + read timeout in seconds. Default: 15.
skip_ssrf_checkbool, optional: Skip SSRF prevention check. Only for trusted internal URLs. Default: False.

Returns:

URLKind

The inferred classification:

URLKind.DOWNLOADABLE — Content-Type indicates a non-HTML binary or structured file (PDF, image, audio, video, archive, CSV, JSON, plain text, etc.).
URLKind.WEB_PAGE — Content-Type is text/html or the probe failed (fail-safe: treat as web page so the caller can still attempt WebReader).

Raises:

ValueError: If url does not start with http:// or https://.

Parameters:

url (str)
timeout (int)
skip_ssrf_check (bool)

Return type:

URLKind

Notes

When to call this: Only when classify_url returns WEB_PAGE and the URL path has no recognisable file extension. For URLs that already have a known extension or are already classified as YOUTUBE / GOOGLE_DRIVE / GITHUB_*, call the faster classify_url directly.

Network cost: One HEAD request (no body download). Adds ~50-500 ms of latency depending on server and network.

Thread safety: This function is stateless and safe to call from multiple threads.

Developer note:

The function tries requests first (better redirect + timeout handling). It falls back to urllib.request if requests is not installed. The SSRF check is applied before connecting when skip_ssrf_check=False.

Examples

>>> # An API endpoint returning a PDF with no extension in path
>>> kind = probe_url_kind(
...     "https://iris.who.int/server/api/core/bitstreams/abc/content"
... )
>>> kind == URLKind.DOWNLOADABLE
True  # Content-Type: application/pdf

>>> # A normal web page
>>> kind = probe_url_kind("https://www.example.com/about")
>>> kind == URLKind.WEB_PAGE
True  # Content-Type: text/html