CustomReader#
- class scikitplot.corpus.CustomReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, custom_extractor=None, custom_extractor_kwargs=<factory>, extractor=None, extensions=None, reader_kwargs=<factory>, default_source_type=SourceType.UNKNOWN, default_section_type=SectionType.TEXT, validate_file=True)[source]#
Fully user-customizable reader for any file extension and resource type.
CustomReaderaccepts any file extension and a caller-supplied extractor callable as its text-extraction engine. This lets users integrate arbitrary third-party or proprietary extraction libraries —pdfplumber,surya,docling, proprietary ASR/OCR APIs, in-memory streams — without writing a fullDocumentReadersubclass.Two usage modes are supported:
Direct use (bypass the extension registry):
reader = CustomReader( input_file=Path("report.xyz"), extractor=my_extractor_fn, ) docs = list(reader.get_documents())
Registered use (wire into
DocumentReader.create()):CustomReader.register( name="XYZReader", extensions=[".xyz"], extractor=my_extractor_fn, ) # DocumentReader.create(Path("report.xyz")) now works automatically.
- Parameters:
- input_filepathlib.Path
Path to the source file (or a synthetic path for non-filesystem resources — set
validate_file=Falsein that case).- extractorcallable or None, optional
User-supplied extraction function. Signature:
def extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput
where
ExtractorOutputis one of:str— full-file text as one chunk.list[str]— one string per logical segment.dict— single chunk with"text"key and optional metadata.list[dict]— multiple chunks, each with a"text"key.
Noneis accepted so thatregister-produced subclasses can be instantiated without explicitly passing an extractor (the bound extractor is injected by__post_init__in the subclass). RaisesValueErrorat extraction time if stillNone. Default:None.- extensionslist of str or None, optional
File extensions this instance handles (e.g.
[".abc"]). Used only byregisterto label the generated subclass; has no effect in single-instance usage. Default:None.- reader_kwargsdict, optional
Extra keyword arguments forwarded to
extractoron every call. Default:{}(empty).- default_source_typeSourceType, optional
Fallback source type for chunks where the extractor does not set
"source_type". Default:UNKNOWN.- default_section_typeSectionType, optional
Fallback section type for chunks where the extractor does not set
"section_type". Default:TEXT.- validate_filebool, optional
When
True(default),validate_inputchecks thatinput_fileexists and is a regular file before extraction. Set toFalsefor non-filesystem sources (network streams, in-memory paths) whereinput_fileis a synthetic path. Default:True.- chunkerChunkerBase or None, optional
Inherited from
DocumentReader.- filter_FilterBase or None, optional
Inherited from
DocumentReader.- filename_overridestr or None, optional
Inherited from
DocumentReader.- default_languagestr or None, optional
Inherited from
DocumentReader.
- Attributes:
- file_typeClassVar[None]
Always
None.CustomReaderdoes not auto-register for any extension. Useregisterto create a registered subclass.
- Raises:
- TypeError
If
extractoris not callable (and notNone).- ValueError
If any element of
extensionsdoes not start with'.'or':'.- ValueError
If
extractorisNonewhenget_raw_chunksis called.
- Parameters:
input_file (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
custom_extractor (Any | None)
default_source_type (SourceType)
default_section_type (SectionType)
validate_file (bool)
See also
CustomReader.registerDynamically register a named subclass.
normalize_extractor_outputCoerce extractor return values.
scikitplot.corpus._readers.PDFReaderBuilt-in PDF reader with
prefer_backend="custom"option.scikitplot.corpus._readers.ImageReaderBuilt-in image reader with
backend="custom"option.scikitplot.corpus._base.DocumentReaderAbstract base class.
Notes
Extractor kwargs —
reader_kwargsis forwarded as**reader_kwargsto the extractor. Use it to pass library-specific options (e.g.{"password": "hunter2"}for an encrypted PDF extractor, or{"language": "en"}for an ASR extractor).Thread safety —
CustomReaderinstances are not thread-safe. Create one instance per thread when parallelising.Empty chunks — the downstream
DefaultFilterdiscards whitespace-only chunks, consistent with all other readers. Empty strings returned by the extractor are silently skipped.Examples
Plug in
pdfplumberas a custom PDF backend:>>> import pdfplumber >>> from pathlib import Path >>> from scikitplot.corpus._readers._custom import CustomReader >>> >>> def pdfplumber_extract(path, **kw): ... with pdfplumber.open(path) as pdf: ... return [ ... {"text": page.extract_text() or "", "page_number": i} ... for i, page in enumerate(pdf.pages) ... ] >>> >>> reader = CustomReader( ... input_file=Path("report.pdf"), ... extractor=pdfplumber_extract, ... ) >>> docs = list(reader.get_documents())
Register globally and use via factory:
>>> CustomReader.register( ... name="PdfPlumberReader", ... extensions=[".pdf"], ... extractor=pdfplumber_extract, ... default_source_type=SourceType.RESEARCH, ... ) >>> reader = DocumentReader.create(Path("report.pdf")) >>> docs = list(reader.get_documents())
Custom audio transcription (e.g. a proprietary ASR API):
>>> def my_asr(path, language="en", **kw): ... result = my_asr_client.transcribe(path, lang=language) ... return [ ... {"text": seg.text, "timecode_start": seg.start, "timecode_end": seg.end} ... for seg in result.segments ... ] >>> >>> CustomReader.register( ... name="MyASRReader", ... extensions=[".mp3", ".wav", ".flac"], ... extractor=my_asr, ... reader_kwargs={"language": "de"}, ... default_source_type=SourceType.PODCAST, ... )
Non-filesystem source (validate_file=False):
>>> def stream_extractor(path, **kw): ... # path is a synthetic Path wrapping a stream identifier ... data = fetch_from_stream(str(path)) ... return data.decode("utf-8") >>> >>> reader = CustomReader( ... input_file=Path("stream://channel/42"), ... extractor=stream_extractor, ... validate_file=False, ... )
- chunker: ChunkerBase | None = None#
Chunker to apply to each raw text block.
Nonemeans each raw chunk is used as-is (one CorpusDocument per raw chunk).
- classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for one or more sources.
Accepts any mix of file paths, URL strings, and
pathlib.Pathobjects — in any order. URL strings (those starting withhttp://orhttps://) are automatically detected and routed tofrom_url; everything else is treated as a local file path and dispatched by extension via the registry.- Parameters:
- *inputspathlib.Path or str
One or more source paths or URL strings. Each element is classified independently:
strmatching^https?://(case-insensitive) — treated as a URL and routed tofrom_url. Must be passed as a plain ``str``, not wrapped inpathlib.Path; wrapping collapses the double-slash (https://→https:/) and breaks URL detection.strnot matching the URL pattern — treated as a local file path and converted topathlib.Pathinternally.pathlib.Path— always treated as a local file path and dispatched by extension via the reader registry.
Pass a single value for the common case; pass multiple values to get a
_MultiSourceReaderthat chains all their documents in order.- chunkerChunkerBase or None, optional
Chunker injected into every reader. Default:
None.- filter_FilterBase or None, optional
Filter injected into every reader. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override the
source_filelabel. Only applied when inputs contains exactly one source. Default:None.- default_languagestr or None, optional
ISO 639-1 language code applied to all sources. Default:
None.- source_typeSourceType, list[SourceType or None], or None, optional
Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source;
Noneentries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default:None.- source_titlestr or None, optional
Title propagated into every yielded document. Default:
None.- source_authorstr or None, optional
Author propagated into every yielded document. Default:
None.- source_datestr or None, optional
ISO 8601 publication date. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier (file sources only). Default:
None.- isbnstr or None, optional
ISBN (file sources only). Default:
None.- **kwargsAny
Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g.
transcribe=TrueforAudioReader,backend="easyocr"forImageReader).
- Returns:
- DocumentReader
A single reader when inputs has exactly one element (backward compatible with every existing call site). A
_MultiSourceReaderwhen inputs has more than one element — it implements the sameget_documents()interface and chains documents from all sub-readers in order.
- Raises:
- ValueError
If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
- TypeError
If any element of inputs is not a
strorpathlib.Path.
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
URL auto-detection: A
strelement is treated as a URL when it matches^https?://(case-insensitive). All other strings and allpathlib.Pathobjects are treated as local file paths. This means you no longer need to callfrom_urlexplicitly — just pass the URL string tocreate.Per-source source_type: When passing multiple inputs with different media types, supply a list:
DocumentReader.create( Path("podcast.mp3"), "report.pdf", "https://iris.who.int/.../content", # returns image/jpeg source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], )
Reader-specific kwargs (forwarded via
**kwargs):transcribe=True,whisper_model="small"→AudioReader,VideoReaderbackend="easyocr"→ImageReaderprefer_backend="pypdf"→PDFReaderclassify=True,classifier=fn→AudioReader
Examples
Single file (backward-compatible):
>>> reader = DocumentReader.create(Path("hamlet.txt")) >>> docs = list(reader.get_documents())
URL string auto-detected — no from_url() call required:
>>> reader = DocumentReader.create( ... "https://en.wikipedia.org/wiki/Python_(programming_language)" ... )
Mixed multi-source batch:
>>> reader = DocumentReader.create( ... Path("podcast.mp3"), ... "report.pdf", ... "https://iris.who.int/api/bitstreams/abc/content", ... source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], ... ) >>> docs = list(reader.get_documents()) # chained stream from all three
- custom_extractor: Any | None = None#
User-supplied extraction callable that replaces
get_raw_chunksentirely for this reader instance.When set,
_iter_raw_chunkscallscustom_extractor(self.input_file, **custom_extractor_kwargs)and normalises the return value throughnormalize_extractor_output. The built-inget_raw_chunksimplementation is not called.This hook is available on every reader class (
ALTOReader,TextReader,PDFReader,ImageReader, etc.) without any subclassing — simply pass a callable at construction time.Examples
Override PDF extraction with
pdfplumberfor a single reader:import pdfplumber from pathlib import Path from scikitplot.corpus._base import DocumentReader def plumber_fn(path, **kw): with pdfplumber.open(path) as pdf: return [{"text": p.extract_text() or "", "page_number": i} for i, p in enumerate(pdf.pages)] reader = DocumentReader.create( Path("report.pdf"), custom_extractor=plumber_fn, ) docs = list(reader.get_documents())
- custom_extractor_kwargs: dict[str, Any][source]#
Extra keyword arguments forwarded to
custom_extractoron every invocation. Merged into the call as**custom_extractor_kwargs.Examples
reader = DocumentReader.create( Path("report.pdf"), custom_extractor=my_fn, custom_extractor_kwargs={"password": "s3cret", "pages": [0, 1, 2]}, )
- default_language: str | None = None#
ISO 639-1 language code to assign when the source has no language info.
- default_section_type: SectionType = 'text'[source]#
Fallback
SectionTypefor chunks where the extractor does not set"section_type".
- default_source_type: SourceType = 'unknown'[source]#
Fallback
SourceTypefor chunks where the extractor does not set"source_type".
- extensions: list[str] | None = None#
Extensions this instance handles. Informational only for single-instance usage; meaningful for
registerwhere it controls which extensions are wired into theDocumentReaderregistry.
- extractor: Callable[[...], Any] | None = None#
User-supplied extraction callable. Accepts
pathlib.Pathplus any**reader_kwargsand must return a value normalizable bynormalize_extractor_output.Noneis allowed here so thatregister-generated subclasses can be instantiated through thecreatefactory without explicitly passing an extractor. RaisesValueErrorat extraction time if stillNone.
- property file_name: str#
Effective filename used in document labels.
Returns
filename_overridewhen set; otherwise returnsinput_file.name.- Returns:
- str
File name string (not a full path).
Examples
>>> from pathlib import Path >>> reader = TextReader(input_file=Path("/data/corpus.txt")) >>> reader.file_name 'corpus.txt'
- file_type: ClassVar[str | None] = None#
Single file extension this reader handles (lowercase, including leading dot). E.g.
".txt",".xml",".zip".For readers that handle multiple extensions, define
file_types(plural) instead. Exactly one offile_typeorfile_typesmust be defined on every concrete subclass.
- file_types: ClassVar[list[str] | None][source]#
List of file extensions this reader handles (lowercase, leading dot). Use instead of
file_typewhen a single reader class should be registered for several extensions — e.g. an image reader for[".png", ".jpg", ".jpeg", ".gif", ".webp"].When both
file_typeandfile_typesare defined on the same class,file_typestakes precedence andfile_typeis ignored.
- filter_: FilterBase | None = None#
Filter applied after chunking.
Nonetriggers theDefaultFilter.
- classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#
Build a
_MultiSourceReaderfrom a manifest file.The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with
#are ignored. JSON manifests (a list of strings or objects) are also supported.- Parameters:
- manifest_pathpathlib.Path or str
Path to the manifest file. Supported formats:
.txt/.manifest— one source per line..json— a JSON array of strings (sources) or objects with at least a"source"key (and optional"source_type","source_title"per-entry overrides).
- chunkerChunkerBase or None, optional
Chunker applied to all sources. Default:
None.- filter_FilterBase or None, optional
Filter applied to all sources. Default:
None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Override source type for all sources. Default:
None.- source_titlestr or None, optional
Override title for all sources. Default:
None.- source_authorstr or None, optional
Override author for all sources. Default:
None.- source_datestr or None, optional
Override date for all sources. Default:
None.- collection_idstr or None, optional
Collection identifier. Default:
None.- doistr or None, optional
DOI override. Default:
None.- isbnstr or None, optional
ISBN override. Default:
None.- encodingstr, optional
Text encoding for
.txtmanifests. Default:"utf-8".- **kwargsAny
Forwarded to each reader constructor.
- Returns:
- _MultiSourceReader
Multi-source reader chaining all manifest entries.
- Raises:
- ValueError
If manifest_path does not exist or is empty after filtering blank and comment lines.
- ValueError
If the manifest format is not recognised.
- Parameters:
- Return type:
Notes
Per-entry overrides in JSON manifests: each entry may be an object with:
{ "source": "https://example.com/report.pdf", "source_type": "research", "source_title": "Annual Report 2024", }
String-level
source_typevalues are coerced viaSourceType(value)and an invalid value raisesValueError.Examples
Text manifest
sources.txt:# WHO corpus https://www.who.int/europe/news/item/... https://youtu.be/rwPISgZcYIk WHO-EURO-2025.pdf scan.jpg
Usage:
reader = DocumentReader.from_manifest( Path("sources.txt"), collection_id="who-corpus", ) docs = list(reader.get_documents())
- classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for a URL source.
Dispatches to
YouTubeReaderfor YouTube URLs and toWebReaderfor all otherhttp:///https://URLs.- Parameters:
- urlstr
Full URL string. Must start with
http://orhttps://.- chunkerChunkerBase or None, optional
Chunker to inject. Default:
None.- filter_FilterBase or None, optional
Filter to inject. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override for the
source_filelabel. Default:None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Semantic label for the source. Default:
None.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date in ISO 8601 format. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- **kwargsAny
Additional kwargs forwarded to the reader constructor (e.g.
include_auto_generated=FalseforYouTubeReader).
- Returns:
- DocumentReader
YouTubeReaderorWebReaderinstance.
- Raises:
- ValueError
If
urldoes not start withhttp://orhttps://.- ImportError
If the required reader class is not registered (i.e.
scikitplot.corpus._readershas not been imported yet).
- Parameters:
url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
Prefer :meth:`create` for new code. Passing a URL string to
createautomatically callsfrom_url— you rarely need to callfrom_urldirectly.Examples
>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python") >>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk") >>> docs = list(yt.get_documents())
- get_documents()[source]#
Yield validated
CorpusDocumentinstances for the input file.Orchestrates the full per-file pipeline:
validate_input— fail fast if file is missing.get_raw_chunks— format-specific text extraction.Chunker (if set) — sub-segments each raw block.
CorpusDocumentconstruction with validated schema.Filter — discards noise documents.
- Yields:
- CorpusDocument
Validated documents that passed the filter.
- Raises:
- ValueError
If the input file is missing or the format is invalid.
- Return type:
Generator[CorpusDocument, None, None]
Notes
The global
chunk_indexcounter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that(source_file, chunk_index)is a unique key within one reader run.Omitted-document statistics are logged at INFO level after processing each file.
Examples
>>> from pathlib import Path >>> reader = DocumentReader.create(Path("corpus.txt")) >>> docs = list(reader.get_documents()) >>> all(isinstance(d, CorpusDocument) for d in docs) True
- get_raw_chunks()[source]#
Call the user-supplied extractor and yield normalised raw chunk dicts.
Calls
self.extractor(self.input_file, **self.reader_kwargs)and normalises the return value withnormalize_extractor_output.- Yields:
- dict
Each dict has at least
{"text": str}, with"section_type"and"source_type"defaults filled in, plus any metadata returned by the extractor.
- Raises:
- ValueError
If
extractorisNoneat call time.- TypeError
If the extractor returns an unsupported type.
- ValueError
If any dict returned by the extractor lacks a
"text"key.- RuntimeError
If the extractor raises an unexpected exception. The original exception is chained via
from.
- Return type:
Notes
Logging at
INFOlevel records the extractor name, file name, and chunk count.DEBUGrecords the kwargs forwarded.
- input_file: Path[source]#
Path to the source file.
For URL-based readers (
WebReader,YouTubeReader), passpathlib.Path(url_string)here and setsource_urito the original URL string.validate_input()is overridden in those subclasses to skip the file-existence check.
- reader_kwargs: dict[str, Any][source]#
Extra keyword arguments forwarded to
extractoron every call.
- classmethod register(*, name, extensions, extractor, reader_kwargs=None, default_source_type=SourceType.UNKNOWN, default_section_type=SectionType.TEXT, validate_file=True)[source]#
Create a named
CustomReadersubclass and register it by extension.After calling
register,DocumentReader.createautomatically dispatches files with any of the givenextensionstoextractor.- Parameters:
- namestr
Class name for the generated subclass (e.g.
"PdfPlumberReader"). Must be a valid Python identifier.- extensionslist of str
File extensions to register (e.g.
[".pdf"]). Each must start with'.'(file extension) or':'(URL-scheme key). Existing registrations for these extensions emit a warning and are replaced, consistent with the base-class registry behaviour.- extractorcallable
Extraction callable. Signature:
def extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput
- reader_kwargsdict or None, optional
Default keyword arguments forwarded to
extractor. Instance- levelreader_kwargs(passed directly to the constructor) are merged on top: instance kwargs override registered defaults. Default:{}(empty).- default_source_typeSourceType, optional
Source type applied to chunks that do not set
"source_type". Default:UNKNOWN.- default_section_typeSectionType, optional
Section type applied to chunks that do not set
"section_type". Default:TEXT.- validate_filebool, optional
When
False, skip the filesystem existence check. Default:True.
- Returns:
- type[CustomReader]
The newly created and registered subclass. The caller can keep a reference to it for type-checking or documentation, but it is not required — the subclass is also stored in
_registry.
- Raises:
- ValueError
If
nameis not a valid Python identifier.- ValueError
If
extensionsis empty or any element has an invalid prefix.- TypeError
If
extractoris not callable.
- Parameters:
- Return type:
Notes
Subclass lifetime — each call to
registercreates a new class object. Callingregisteragain with the samenameproduces a distinct class object. The last registration for a given extension wins (matching the general registry policy).reader_kwargs merging — instance-level kwargs (passed when constructing the reader) are merged on top of the registered defaults:
# Registered defaults: {"language": "en"} # Instance override: {"language": "de"} reader = DocumentReader.create( Path("file.mp3"), reader_kwargs={"language": "de"}, # forwarded via **kwargs ) # extractor receives language="de"
Type annotation — the returned class is typed as
type[CustomReader]. If you need the precise subclass type, assign it directly:MyReader = CustomReader.register(name="MyReader", ...)
Examples
Register a
pdfplumber-based PDF reader:>>> import pdfplumber >>> from pathlib import Path >>> from scikitplot.corpus._readers._custom import CustomReader >>> from scikitplot.corpus._schema import SourceType >>> >>> def pdfplumber_extract(path, **kw): ... with pdfplumber.open(path) as pdf: ... return [ ... {"text": p.extract_text() or "", "page_number": i} ... for i, p in enumerate(pdf.pages) ... ] >>> >>> PdfPlumberReader = CustomReader.register( ... name="PdfPlumberReader", ... extensions=[".pdf"], ... extractor=pdfplumber_extract, ... default_source_type=SourceType.RESEARCH, ... ) >>> docs = list(DocumentReader.create(Path("paper.pdf")).get_documents())
Register a multi-extension audio reader using a proprietary API:
>>> MyASRReader = CustomReader.register( ... name="MyASRReader", ... extensions=[".mp3", ".wav", ".flac"], ... extractor=my_asr_fn, ... reader_kwargs={"model": "large-v3", "language": "en"}, ... default_source_type=SourceType.PODCAST, ... )
- source_provenance: dict[str, Any][source]#
Provenance overrides propagated into every yielded
CorpusDocument.Keys may include
"source_type","source_title","source_author", and"collection_id". Populated bycreate/from_urlfrom their keyword arguments.
- source_uri: str | None = None#
Original URI for URL-based readers (web pages, YouTube videos).
Set this to the full URL string when
input_fileis a syntheticpathlib.Pathwrapping a URL. File-based readers leave thisNone.Examples
>>> reader = WebReader( ... input_file=Path("https://example.com/article"), ... source_uri="https://example.com/article", ... )
- classmethod subclass_by_type()[source]#
Return a copy of the extension → reader class registry.
- Returns:
- dict
Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.
- Return type:
Examples
>>> registry = DocumentReader.subclass_by_type() >>> ".txt" in registry True
- classmethod supported_types()[source]#
Return a sorted list of file extensions supported by registered readers.
- Returns:
- list of str
Lowercase file extensions, each including the leading dot. E.g.
['.pdf', '.txt', '.xml', '.zip'].
- Return type:
Examples
>>> DocumentReader.supported_types() ['.pdf', '.txt', '.xml', '.zip']
- validate_file: bool = True#
When
False, skip the filesystem existence check invalidate_input. Use for non-filesystem resources whereinput_fileis a synthetic path.
- validate_input()[source]#
Check source accessibility.
Delegates to the parent implementation when
validate_fileisTrue; skips the filesystem check entirely when it isFalse(for non-filesystem sources).- Raises:
- ValueError
If
validate_fileisTrueand the file does not exist or is not a regular file.
- Return type:
None