CustomReader#

class scikitplot.corpus.CustomReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, custom_extractor=None, custom_extractor_kwargs=<factory>, extractor=None, extensions=None, reader_kwargs=<factory>, default_source_type=SourceType.UNKNOWN, default_section_type=SectionType.TEXT, validate_file=True)[source]#

Fully user-customizable reader for any file extension and resource type.

CustomReader accepts any file extension and a caller-supplied extractor callable as its text-extraction engine. This lets users integrate arbitrary third-party or proprietary extraction libraries — pdfplumber, surya, docling, proprietary ASR/OCR APIs, in-memory streams — without writing a full DocumentReader subclass.

Two usage modes are supported:

Direct use (bypass the extension registry):

reader = CustomReader(
    input_file=Path("report.xyz"),
    extractor=my_extractor_fn,
)
docs = list(reader.get_documents())

Registered use (wire into DocumentReader.create()):

CustomReader.register(
    name="XYZReader",
    extensions=[".xyz"],
    extractor=my_extractor_fn,
)
# DocumentReader.create(Path("report.xyz")) now works automatically.
Parameters:
input_filepathlib.Path

Path to the source file (or a synthetic path for non-filesystem resources — set validate_file=False in that case).

extractorcallable or None, optional

User-supplied extraction function. Signature:

def extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput

where ExtractorOutput is one of:

  • str — full-file text as one chunk.

  • list[str] — one string per logical segment.

  • dict — single chunk with "text" key and optional metadata.

  • list[dict] — multiple chunks, each with a "text" key.

None is accepted so that register-produced subclasses can be instantiated without explicitly passing an extractor (the bound extractor is injected by __post_init__ in the subclass). Raises ValueError at extraction time if still None. Default: None.

extensionslist of str or None, optional

File extensions this instance handles (e.g. [".abc"]). Used only by register to label the generated subclass; has no effect in single-instance usage. Default: None.

reader_kwargsdict, optional

Extra keyword arguments forwarded to extractor on every call. Default: {} (empty).

default_source_typeSourceType, optional

Fallback source type for chunks where the extractor does not set "source_type". Default: UNKNOWN.

default_section_typeSectionType, optional

Fallback section type for chunks where the extractor does not set "section_type". Default: TEXT.

validate_filebool, optional

When True (default), validate_input checks that input_file exists and is a regular file before extraction. Set to False for non-filesystem sources (network streams, in-memory paths) where input_file is a synthetic path. Default: True.

chunkerChunkerBase or None, optional

Inherited from DocumentReader.

filter_FilterBase or None, optional

Inherited from DocumentReader.

filename_overridestr or None, optional

Inherited from DocumentReader.

default_languagestr or None, optional

Inherited from DocumentReader.

Attributes:
file_typeClassVar[None]

Always None. CustomReader does not auto-register for any extension. Use register to create a registered subclass.

Raises:
TypeError

If extractor is not callable (and not None).

ValueError

If any element of extensions does not start with '.' or ':'.

ValueError

If extractor is None when get_raw_chunks is called.

Parameters:
  • input_file (Path)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_uri (str | None)

  • source_provenance (dict[str, Any])

  • custom_extractor (Any | None)

  • custom_extractor_kwargs (dict[str, Any])

  • extractor (Callable[[...], Any] | None)

  • extensions (list[str] | None)

  • reader_kwargs (dict[str, Any])

  • default_source_type (SourceType)

  • default_section_type (SectionType)

  • validate_file (bool)

See also

CustomReader.register

Dynamically register a named subclass.

normalize_extractor_output

Coerce extractor return values.

scikitplot.corpus._readers.PDFReader

Built-in PDF reader with prefer_backend="custom" option.

scikitplot.corpus._readers.ImageReader

Built-in image reader with backend="custom" option.

scikitplot.corpus._base.DocumentReader

Abstract base class.

Notes

Extractor kwargsreader_kwargs is forwarded as **reader_kwargs to the extractor. Use it to pass library-specific options (e.g. {"password": "hunter2"} for an encrypted PDF extractor, or {"language": "en"} for an ASR extractor).

Thread safetyCustomReader instances are not thread-safe. Create one instance per thread when parallelising.

Empty chunks — the downstream DefaultFilter discards whitespace-only chunks, consistent with all other readers. Empty strings returned by the extractor are silently skipped.

Examples

Plug in pdfplumber as a custom PDF backend:

>>> import pdfplumber
>>> from pathlib import Path
>>> from scikitplot.corpus._readers._custom import CustomReader
>>>
>>> def pdfplumber_extract(path, **kw):
...     with pdfplumber.open(path) as pdf:
...         return [
...             {"text": page.extract_text() or "", "page_number": i}
...             for i, page in enumerate(pdf.pages)
...         ]
>>>
>>> reader = CustomReader(
...     input_file=Path("report.pdf"),
...     extractor=pdfplumber_extract,
... )
>>> docs = list(reader.get_documents())

Register globally and use via factory:

>>> CustomReader.register(
...     name="PdfPlumberReader",
...     extensions=[".pdf"],
...     extractor=pdfplumber_extract,
...     default_source_type=SourceType.RESEARCH,
... )
>>> reader = DocumentReader.create(Path("report.pdf"))
>>> docs = list(reader.get_documents())

Custom audio transcription (e.g. a proprietary ASR API):

>>> def my_asr(path, language="en", **kw):
...     result = my_asr_client.transcribe(path, lang=language)
...     return [
...         {"text": seg.text, "timecode_start": seg.start, "timecode_end": seg.end}
...         for seg in result.segments
...     ]
>>>
>>> CustomReader.register(
...     name="MyASRReader",
...     extensions=[".mp3", ".wav", ".flac"],
...     extractor=my_asr,
...     reader_kwargs={"language": "de"},
...     default_source_type=SourceType.PODCAST,
... )

Non-filesystem source (validate_file=False):

>>> def stream_extractor(path, **kw):
...     # path is a synthetic Path wrapping a stream identifier
...     data = fetch_from_stream(str(path))
...     return data.decode("utf-8")
>>>
>>> reader = CustomReader(
...     input_file=Path("stream://channel/42"),
...     extractor=stream_extractor,
...     validate_file=False,
... )
chunker: ChunkerBase | None = None#

Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:
*inputspathlib.Path or str

One or more source paths or URL strings. Each element is classified independently:

  • str matching ^https?:// (case-insensitive) — treated as a URL and routed to from_url. Must be passed as a plain ``str``, not wrapped in pathlib.Path; wrapping collapses the double-slash (https://https:/) and breaks URL detection.

  • str not matching the URL pattern — treated as a local file path and converted to pathlib.Path internally.

  • pathlib.Path — always treated as a local file path and dispatched by extension via the reader registry.

Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents in order.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the source_file label. Only applied when inputs contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:
DocumentReader

A single reader when inputs has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when inputs has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:
ValueError

If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.

TypeError

If any element of inputs is not a str or pathlib.Path.

Parameters:
  • inputs (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | list[SourceType | None] | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple inputs with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three
custom_extractor: Any | None = None#

User-supplied extraction callable that replaces get_raw_chunks entirely for this reader instance.

When set, _iter_raw_chunks calls custom_extractor(self.input_file, **custom_extractor_kwargs) and normalises the return value through normalize_extractor_output. The built-in get_raw_chunks implementation is not called.

This hook is available on every reader class (ALTOReader, TextReader, PDFReader, ImageReader, etc.) without any subclassing — simply pass a callable at construction time.

Examples

Override PDF extraction with pdfplumber for a single reader:

import pdfplumber
from pathlib import Path
from scikitplot.corpus._base import DocumentReader

def plumber_fn(path, **kw):
    with pdfplumber.open(path) as pdf:
        return [{"text": p.extract_text() or "", "page_number": i}
                for i, p in enumerate(pdf.pages)]

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=plumber_fn,
)
docs = list(reader.get_documents())
custom_extractor_kwargs: dict[str, Any][source]#

Extra keyword arguments forwarded to custom_extractor on every invocation. Merged into the call as **custom_extractor_kwargs.

Examples

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=my_fn,
    custom_extractor_kwargs={"password": "s3cret", "pages": [0, 1, 2]},
)
default_language: str | None = None#

ISO 639-1 language code to assign when the source has no language info.

default_section_type: SectionType = 'text'[source]#

Fallback SectionType for chunks where the extractor does not set "section_type".

default_source_type: SourceType = 'unknown'[source]#

Fallback SourceType for chunks where the extractor does not set "source_type".

extensions: list[str] | None = None#

Extensions this instance handles. Informational only for single-instance usage; meaningful for register where it controls which extensions are wired into the DocumentReader registry.

extractor: Callable[[...], Any] | None = None#

User-supplied extraction callable. Accepts pathlib.Path plus any **reader_kwargs and must return a value normalizable by normalize_extractor_output. None is allowed here so that register-generated subclasses can be instantiated through the create factory without explicitly passing an extractor. Raises ValueError at extraction time if still None.

property file_name: str#

Effective filename used in document labels.

Returns filename_override when set; otherwise returns input_file.name.

Returns:
str

File name string (not a full path).

Examples

>>> from pathlib import Path
>>> reader = TextReader(input_file=Path("/data/corpus.txt"))
>>> reader.file_name
'corpus.txt'
file_type: ClassVar[str | None] = None#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None][source]#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#

Override for the source_file label in generated documents.

filter_: FilterBase | None = None#

Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:
manifest_pathpathlib.Path or str

Path to the manifest file. Supported formats:

  • .txt / .manifest — one source per line.

  • .json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:
_MultiSourceReader

Multi-source reader chaining all manifest entries.

Raises:
ValueError

If manifest_path does not exist or is empty after filtering blank and comment lines.

ValueError

If the manifest format is not recognised.

Parameters:
  • manifest_path (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • encoding (str)

  • kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())
classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:
urlstr

Full URL string. Must start with http:// or https://.

chunkerChunkerBase or None, optional

Chunker to inject. Default: None.

filter_FilterBase or None, optional

Filter to inject. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override for the source_file label. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Semantic label for the source. Default: None.

source_titlestr or None, optional

Title of the source work. Default: None.

source_authorstr or None, optional

Primary author. Default: None.

source_datestr or None, optional

Publication date in ISO 8601 format. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier. Default: None.

isbnstr or None, optional

International Standard Book Number. Default: None.

**kwargsAny

Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:
DocumentReader

YouTubeReader or WebReader instance.

Raises:
ValueError

If url does not start with http:// or https://.

ImportError

If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:
  • url (str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> docs = list(yt.get_documents())
get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

  1. validate_input — fail fast if file is missing.

  2. get_raw_chunks — format-specific text extraction.

  3. Chunker (if set) — sub-segments each raw block.

  4. CorpusDocument construction with validated schema.

  5. Filter — discards noise documents.

Yields:
CorpusDocument

Validated documents that passed the filter.

Raises:
ValueError

If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (source_file, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True
get_raw_chunks()[source]#

Call the user-supplied extractor and yield normalised raw chunk dicts.

Calls self.extractor(self.input_file, **self.reader_kwargs) and normalises the return value with normalize_extractor_output.

Yields:
dict

Each dict has at least {"text": str}, with "section_type" and "source_type" defaults filled in, plus any metadata returned by the extractor.

Raises:
ValueError

If extractor is None at call time.

TypeError

If the extractor returns an unsupported type.

ValueError

If any dict returned by the extractor lacks a "text" key.

RuntimeError

If the extractor raises an unexpected exception. The original exception is chained via from.

Return type:

Generator[dict[str, Any], None, None]

Notes

Logging at INFO level records the extractor name, file name, and chunk count. DEBUG records the kwargs forwarded.

input_file: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

reader_kwargs: dict[str, Any][source]#

Extra keyword arguments forwarded to extractor on every call.

classmethod register(*, name, extensions, extractor, reader_kwargs=None, default_source_type=SourceType.UNKNOWN, default_section_type=SectionType.TEXT, validate_file=True)[source]#

Create a named CustomReader subclass and register it by extension.

After calling register, DocumentReader.create automatically dispatches files with any of the given extensions to extractor.

Parameters:
namestr

Class name for the generated subclass (e.g. "PdfPlumberReader"). Must be a valid Python identifier.

extensionslist of str

File extensions to register (e.g. [".pdf"]). Each must start with '.' (file extension) or ':' (URL-scheme key). Existing registrations for these extensions emit a warning and are replaced, consistent with the base-class registry behaviour.

extractorcallable

Extraction callable. Signature:

def extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput
reader_kwargsdict or None, optional

Default keyword arguments forwarded to extractor. Instance- level reader_kwargs (passed directly to the constructor) are merged on top: instance kwargs override registered defaults. Default: {} (empty).

default_source_typeSourceType, optional

Source type applied to chunks that do not set "source_type". Default: UNKNOWN.

default_section_typeSectionType, optional

Section type applied to chunks that do not set "section_type". Default: TEXT.

validate_filebool, optional

When False, skip the filesystem existence check. Default: True.

Returns:
type[CustomReader]

The newly created and registered subclass. The caller can keep a reference to it for type-checking or documentation, but it is not required — the subclass is also stored in _registry.

Raises:
ValueError

If name is not a valid Python identifier.

ValueError

If extensions is empty or any element has an invalid prefix.

TypeError

If extractor is not callable.

Parameters:
  • name (str)

  • extensions (list[str])

  • extractor (Callable[[...], Any])

  • reader_kwargs (dict[str, Any] | None)

  • default_source_type (SourceType)

  • default_section_type (SectionType)

  • validate_file (bool)

Return type:

type[CustomReader]

Notes

Subclass lifetime — each call to register creates a new class object. Calling register again with the same name produces a distinct class object. The last registration for a given extension wins (matching the general registry policy).

reader_kwargs merging — instance-level kwargs (passed when constructing the reader) are merged on top of the registered defaults:

# Registered defaults: {"language": "en"}
# Instance override: {"language": "de"}
reader = DocumentReader.create(
    Path("file.mp3"),
    reader_kwargs={"language": "de"},  # forwarded via **kwargs
)
# extractor receives language="de"

Type annotation — the returned class is typed as type[CustomReader]. If you need the precise subclass type, assign it directly:

MyReader = CustomReader.register(name="MyReader", ...)

Examples

Register a pdfplumber-based PDF reader:

>>> import pdfplumber
>>> from pathlib import Path
>>> from scikitplot.corpus._readers._custom import CustomReader
>>> from scikitplot.corpus._schema import SourceType
>>>
>>> def pdfplumber_extract(path, **kw):
...     with pdfplumber.open(path) as pdf:
...         return [
...             {"text": p.extract_text() or "", "page_number": i}
...             for i, p in enumerate(pdf.pages)
...         ]
>>>
>>> PdfPlumberReader = CustomReader.register(
...     name="PdfPlumberReader",
...     extensions=[".pdf"],
...     extractor=pdfplumber_extract,
...     default_source_type=SourceType.RESEARCH,
... )
>>> docs = list(DocumentReader.create(Path("paper.pdf")).get_documents())

Register a multi-extension audio reader using a proprietary API:

>>> MyASRReader = CustomReader.register(
...     name="MyASRReader",
...     extensions=[".mp3", ".wav", ".flac"],
...     extractor=my_asr_fn,
...     reader_kwargs={"model": "large-v3", "language": "en"},
...     default_source_type=SourceType.PODCAST,
... )
source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_file is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_file=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )
classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:
dict

Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True
classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:
list of str

Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']
validate_file: bool = True#

When False, skip the filesystem existence check in validate_input. Use for non-filesystem resources where input_file is a synthetic path.

validate_input()[source]#

Check source accessibility.

Delegates to the parent implementation when validate_file is True; skips the filesystem check entirely when it is False (for non-filesystem sources).

Raises:
ValueError

If validate_file is True and the file does not exist or is not a regular file.

Return type:

None