WebReader#

class scikitplot.corpus.WebReader(input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, custom_extractor=None, custom_extractor_kwargs=<factory>, timeout=30, max_response_bytes=10485760, headers=None, extract_tags=None, allow_private_networks=False, max_content_bytes=50000000)[source]#

Fetch a web page and extract structured text via BeautifulSoup.

Each HTML element (title, headings, paragraphs, list items) is yielded as a separate raw chunk with its section type and the source URL as metadata. JavaScript-rendered content is not supported (use Playwright or Selenium for that).

Parameters:

input_pathpathlib.Path: Wrap the URL string as pathlib.Path(url) when constructing directly. Use from_url for the canonical construction path.
source_uristr or None, optional: The original URL string. Set automatically by from_url(). If None, str(input_path) is used as the URL.
timeoutint, optional: HTTP request timeout in seconds. Default: 30.
max_response_bytesint, optional: Maximum response body size. Responses larger than this raise ValueError. Default: 10 MB.
headersdict or None, optional: Extra HTTP headers to include in the request. A sensible User-Agent is added automatically when not supplied. Default: None.
extract_tagslist of str or None, optional: HTML tags to extract text from. When None (default), uses the built-in set: ["title", "h1"-"h6", "p", "li", "blockquote", "pre", "td"]. Override to narrow or expand extraction.
chunkerChunkerBase or None, optional: Inherited from DocumentReader.
filter_FilterBase or None, optional: Inherited from DocumentReader.
default_languagestr or None, optional: Inherited from DocumentReader.
allow_private_networksbool: If False, reject private/loopback/link-local IP addresses.
max_content_bytesint: Maximum content size in bytes.

Attributes:

file_typestr: Class variable. Registry key ":url".
file_typeslist of str: Class variable. Registered extensions: [":url"].

Raises:

ImportError: If requests or beautifulsoup4 is not installed.
ValueError: If the response exceeds max_response_bytes.
RuntimeError: If the HTTP request returns a non-2xx status code.

Parameters:

input_path (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
source_provenance (dict[str, Any])
custom_extractor (Any | None)
custom_extractor_kwargs (dict[str, Any])
timeout (int)
max_response_bytes (int)
headers (dict[str, str] | None)
extract_tags (list[str] | None)
allow_private_networks (bool)
max_content_bytes (int)

See also

scikitplot.corpus._readers.YouTubeReader: YouTube transcript extraction.

Notes

robots.txt: This reader does not enforce robots.txt. Callers are responsible for checking /robots.txt before scraping at scale.

Rate limiting: For bulk URL ingestion, add delays between calls or use a polite scraping library (scrapy, httpx with backoff).

JavaScript-rendered pages: requests fetches only the initial HTML. SPAs (React, Vue, Angular) will yield little or no text. Use playwright or selenium to render and then pass the final HTML to BeautifulSoup manually.

Examples

Via factory (recommended):

>>> from scikitplot.corpus._base import DocumentReader
>>> import scikitplot.corpus._readers
>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())
>>> print(f"Extracted {len(docs)} text sections")

Direct construction:

>>> from pathlib import Path
>>> url = "https://en.wikipedia.org/wiki/Python"
>>> reader = WebReader(input_path=Path(url), source_uri=url)

allow_private_networks: bool = False#: If False, reject private/loopback/link-local IP addresses.

chunker: ChunkerBase | None = None#: Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classmethod create(*input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:

*input_pathstr or pathlib.Path

One or more source paths or URL strings. Each element is classified independently:

str matching ^https?:// (case-insensitive) — treated as a URL and routed to from_url. Must be passed as a plain ``str``, not wrapped in pathlib.Path; wrapping collapses the double-slash (https:// → https:/) and breaks URL detection.
str not matching the URL pattern — treated as a local file path and converted to pathlib.Path internally.
pathlib.Path — always treated as a local file path and dispatched by extension via the reader registry.

Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents in order.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the input_path label. Only applied when input_path contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When input_path has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:

DocumentReader: A single reader when input_path has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when input_path has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:

ValueError: If input_path is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
TypeError: If any element of input_path is not a str or pathlib.Path.

Parameters:

input_path (str | pathlib.Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple input_path with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

transcribe=True, whisper_model="small" → AudioReader, VideoReader
backend="easyocr" → ImageReader
prefer_backend="pypdf" → PDFReader
classify=True, classifier=fn → AudioReader

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three

custom_extractor: Any | None = None#

User-supplied extraction callable that replaces get_raw_chunks entirely for this reader instance.

When set, _iter_raw_chunks calls custom_extractor(self.input_path, **custom_extractor_kwargs) and normalises the return value through normalize_extractor_output. The built-in get_raw_chunks implementation is not called.

This hook is available on every reader class (ALTOReader, TextReader, PDFReader, ImageReader, etc.) without any subclassing — simply pass a callable at construction time.

Callable contract

def my_extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput

where ExtractorOutput is str, list[str], dict, or list[dict] — the same contract as CustomReader.

Examples

Override PDF extraction with pdfplumber for a single reader:

import pdfplumber
from pathlib import Path
from scikitplot.corpus._base import DocumentReader

def plumber_fn(path, **kw):
    with pdfplumber.open(path) as pdf:
        return [{"text": p.extract_text() or "", "page_number": i}
                for i, p in enumerate(pdf.pages)]

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=plumber_fn,
)
docs = list(reader.get_documents())

custom_extractor_kwargs: dict[str, Any][source]#

Extra keyword arguments forwarded to custom_extractor on every invocation. Merged into the call as **custom_extractor_kwargs.

Examples

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=my_fn,
    custom_extractor_kwargs={"password": "s3cret", "pages": [0, 1, 2]},
)

default_language: str | None = None#: ISO 639-1 language code to assign when the source has no language info.

extract_tags: list[str] | None = None#: HTML tags to extract. None uses the built-in defaults.

property file_name: str#: Return the URL as the effective file name.

file_type: ClassVar[str | None][source]#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None] = [':url']#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#: Override for the input_path label in generated documents.

filter_: FilterBase | None = None#: Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:

manifest_pathstr or pathlib.Path

Path to the manifest file. Supported formats:

.txt / .manifest — one source per line.
.json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:

_MultiSourceReader: Multi-source reader chaining all manifest entries.

Raises:

ValueError: If manifest_path does not exist or is empty after filtering blank and comment lines.
ValueError: If the manifest format is not recognised.

Parameters:

manifest_path (str | Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
encoding (str)
kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())

classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:

urlstr: Full URL string. Must start with http:// or https://.
chunkerChunkerBase or None, optional: Chunker to inject. Default: None.
filter_FilterBase or None, optional: Filter to inject. Default: None (DefaultFilter).
filename_overridestr or None, optional: Override for the input_path label. Default: None.
default_languagestr or None, optional: ISO 639-1 language code. Default: None.
source_typeSourceType or None, optional: Semantic label for the source. Default: None.
source_titlestr or None, optional: Title of the source work. Default: None.
source_authorstr or None, optional: Primary author. Default: None.
source_datestr or None, optional: Publication date in ISO 8601 format. Default: None.
collection_idstr or None, optional: Corpus collection identifier. Default: None.
doistr or None, optional: Digital Object Identifier. Default: None.
isbnstr or None, optional: International Standard Book Number. Default: None.
**kwargsAny: Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:

DocumentReader: YouTubeReader or WebReader instance.

Raises:

ValueError: If url does not start with http:// or https://.
ImportError: If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:

url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())

>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> docs = list(yt.get_documents())

get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

validate_input — fail fast if file is missing.
get_raw_chunks — format-specific text extraction.
Chunker (if set) — sub-segments each raw block.
CorpusDocument construction with validated schema.
Filter — discards noise documents.

Yields:

CorpusDocument: Validated documents that passed the filter.

Raises:

ValueError: If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (input_path, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True

get_raw_chunks()[source]#

Fetch the URL and yield one chunk per HTML text element.

Yields:

dict

Keys:

"text": Extracted text for this element.
"section_type": SectionType.TITLE, SectionType.HEADER, or SectionType.TEXT.
"source_type": Always SourceType.WEB; promoted to CorpusDocument.source_type.
"html_tag": Original HTML tag name (e.g. "p", "h2").
"url": Source URL; promoted to CorpusDocument.url.
"element_index": Zero-based position of this element in the extraction order.

Raises:

ImportError: If requests or beautifulsoup4 is not installed.
ValueError: If the response body exceeds max_response_bytes.
RuntimeError: If the HTTP response status is not 2xx.

Return type:

Generator[dict[str, Any], None, None]

headers: dict[str, str] | None = None#: Extra HTTP request headers.

input_path: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

max_content_bytes: int = 50000000#: Maximum content size in bytes.

max_response_bytes: int = 10485760#

Maximum streamed response body size in bytes.

BUG-06 clarification: WebReader enforces two independent byte limits:

max_content_bytes — checked against the Content-Length HTTP header before reading the body. Aborts early without downloading anything when the server advertises an oversized response. (Pre-download header check; server may omit the header.)
max_response_bytes — checked against the actual bytes read during streaming. Aborts mid-stream if the body exceeds this value, even when no Content-Length header was sent.

Both limits must be > 0 (validated in __post_init__). Default: 10 MB.

source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_path is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_path=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )

classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:

dict: Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True

classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:

list of str: Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']

timeout: int = 30#: HTTP request timeout in seconds.

validate_input()[source]#

Validate the URL format instead of checking for a local file.

Raises:

ValueError: If the URL does not start with http:// or https://.

Return type:

None