probe_url_kind#
- scikitplot.corpus.probe_url_kind(url, *, timeout=15, skip_ssrf_check=False)[source]#
Probe a URL with a HEAD request to classify by Content-Type.
Use this when
classify_urlreturnsURLKind.WEB_PAGEbut the URL has no file extension in the path (e.g. API endpoints like/content,/download,/bitstream). A HEAD request is sent first; if that fails a GET withstream=Trueis attempted (some servers reject HEAD). TheContent-Typeresponse header is read and mapped to the correctURLKind.- Parameters:
- urlstr
HTTP/HTTPS URL to probe.
- timeoutint, optional
Connection + read timeout in seconds. Default: 15.
- skip_ssrf_checkbool, optional
Skip SSRF prevention check. Only for trusted internal URLs. Default:
False.
- Returns:
- URLKind
The inferred classification:
URLKind.DOWNLOADABLE— Content-Type indicates a non-HTML binary or structured file (PDF, image, audio, video, archive, CSV, JSON, plain text, etc.).URLKind.WEB_PAGE— Content-Type istext/htmlor the probe failed (fail-safe: treat as web page so the caller can still attempt WebReader).
- Raises:
- ValueError
If url does not start with
http://orhttps://.
- Parameters:
- Return type:
Notes
When to call this: Only when
classify_urlreturnsWEB_PAGEand the URL path has no recognisable file extension. For URLs that already have a known extension or are already classified as YOUTUBE / GOOGLE_DRIVE / GITHUB_*, call the fasterclassify_urldirectly.Network cost: One HEAD request (no body download). Adds ~50-500 ms of latency depending on server and network.
Thread safety: This function is stateless and safe to call from multiple threads.
Developer note:
The function tries
requestsfirst (better redirect + timeout handling). It falls back tourllib.requestif requests is not installed. The SSRF check is applied before connecting whenskip_ssrf_check=False.Examples
>>> # An API endpoint returning a PDF with no extension in path >>> kind = probe_url_kind( ... "https://iris.who.int/server/api/core/bitstreams/abc/content" ... ) >>> kind == URLKind.DOWNLOADABLE True # Content-Type: application/pdf
>>> # A normal web page >>> kind = probe_url_kind("https://www.example.com/about") >>> kind == URLKind.WEB_PAGE True # Content-Type: text/html