ZipReader#
- class scikitplot.corpus.ZipReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, max_files=10000, max_total_bytes=2147483648, skip_unsupported=True, infer_source_type=True, reader_kwargs=<factory>)[source]#
Generic ZIP archive reader — dispatches each member to its natural reader.
Extracts all supported members from a
.ziparchive into a temporary directory, then callsDocumentReader.createon each file. Documents from all members are yielded in a single stream.This reader intentionally overrides
ALTOReader's".zip"registration. To useALTOReaderdirectly, instantiate it explicitly.- Parameters:
- input_filepathlib.Path
Path to the
.ziparchive.- max_filesint, optional
Maximum number of files allowed in the archive. Archives exceeding this limit raise
ValueErrorbefore extraction begins. Default: 10,000.- max_total_bytesint, optional
Maximum cumulative uncompressed size. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.
- skip_unsupportedbool, optional
When
True(default), members whose extension is not registered inDocumentReaderare silently skipped. WhenFalse, an unregistered extension raisesValueError.
- Parameters:
input_file (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
max_files (int)
max_total_bytes (int)
skip_unsupported (bool)
infer_source_type (bool)
Notes
ALTOReader coexistence:
ALTOReaderregisters on".zip"and is used whenZipReaderis not imported. Importingscikitplot.corpus._readerstriggers both imports;ZipReaderis imported afterALTOReader, so its".zip"registration wins. To useALTOReaderon a known ALTO archive, instantiate it directly:reader = ALTOReader(input_file=Path("alto_corpus.zip"))
Temporary directory: Extracted files land in
tempfile.mkdtemp(). The directory is deleted automatically whenget_documents()completes (including on exception). If you iterate manually without exhausting the generator, callreader.close()to clean up.Reader kwargs forwarding: All constructor kwargs beyond the explicit parameters are forwarded to each sub-reader constructed for the members. This means you can pass
transcribe=Trueand it will reach anyAudioReaderorVideoReaderinstances created for audio/video members.Examples
Any ZIP containing a mix of supported files:
>>> from pathlib import Path >>> from scikitplot.corpus._base import DocumentReader >>> import scikitplot.corpus._readers >>> reader = DocumentReader.create(Path("corpus.zip")) >>> type(reader).__name__ \'ZipReader\' >>> docs = list(reader.get_documents())
Explicit instantiation with custom limits:
>>> reader = ZipReader( ... input_file=Path("large_corpus.zip"), ... max_files=500, ... max_total_bytes=500 * 1024 * 1024, ... )
- chunker: ChunkerBase | None = None#
Chunker to apply to each raw text block.
Nonemeans each raw chunk is used as-is (one CorpusDocument per raw chunk).
- classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for one or more sources.
Accepts any mix of file paths, URL strings, and
pathlib.Pathobjects — in any order. URL strings (those starting withhttp://orhttps://) are automatically detected and routed tofrom_url; everything else is treated as a local file path and dispatched by extension via the registry.- Parameters:
- *inputspathlib.Path or str
One or more source paths or URL strings. Each element is classified independently:
strmatching^https?://(case-insensitive) — treated as a URL and routed tofrom_url. Must be passed as a plain ``str``, not wrapped inpathlib.Path; wrapping collapses the double-slash (https://→https:/) and breaks URL detection.strnot matching the URL pattern — treated as a local file path and converted topathlib.Pathinternally.pathlib.Path— always treated as a local file path and dispatched by extension via the reader registry.
Pass a single value for the common case; pass multiple values to get a
_MultiSourceReaderthat chains all their documents in order.- chunkerChunkerBase or None, optional
Chunker injected into every reader. Default:
None.- filter_FilterBase or None, optional
Filter injected into every reader. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override the
source_filelabel. Only applied when inputs contains exactly one source. Default:None.- default_languagestr or None, optional
ISO 639-1 language code applied to all sources. Default:
None.- source_typeSourceType, list[SourceType or None], or None, optional
Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source;
Noneentries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default:None.- source_titlestr or None, optional
Title propagated into every yielded document. Default:
None.- source_authorstr or None, optional
Author propagated into every yielded document. Default:
None.- source_datestr or None, optional
ISO 8601 publication date. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier (file sources only). Default:
None.- isbnstr or None, optional
ISBN (file sources only). Default:
None.- **kwargsAny
Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g.
transcribe=TrueforAudioReader,backend="easyocr"forImageReader).
- Returns:
- DocumentReader
A single reader when inputs has exactly one element (backward compatible with every existing call site). A
_MultiSourceReaderwhen inputs has more than one element — it implements the sameget_documents()interface and chains documents from all sub-readers in order.
- Raises:
- ValueError
If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
- TypeError
If any element of inputs is not a
strorpathlib.Path.
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
URL auto-detection: A
strelement is treated as a URL when it matches^https?://(case-insensitive). All other strings and allpathlib.Pathobjects are treated as local file paths. This means you no longer need to callfrom_urlexplicitly — just pass the URL string tocreate.Per-source source_type: When passing multiple inputs with different media types, supply a list:
DocumentReader.create( Path("podcast.mp3"), "report.pdf", "https://iris.who.int/.../content", # returns image/jpeg source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], )
Reader-specific kwargs (forwarded via
**kwargs):transcribe=True,whisper_model="small"→AudioReader,VideoReaderbackend="easyocr"→ImageReaderprefer_backend="pypdf"→PDFReaderclassify=True,classifier=fn→AudioReader
Examples
Single file (backward-compatible):
>>> reader = DocumentReader.create(Path("hamlet.txt")) >>> docs = list(reader.get_documents())
URL string auto-detected — no from_url() call required:
>>> reader = DocumentReader.create( ... "https://en.wikipedia.org/wiki/Python_(programming_language)" ... )
Mixed multi-source batch:
>>> reader = DocumentReader.create( ... Path("podcast.mp3"), ... "report.pdf", ... "https://iris.who.int/api/bitstreams/abc/content", ... source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], ... ) >>> docs = list(reader.get_documents()) # chained stream from all three
- default_language: str | None = None#
ISO 639-1 language code to assign when the source has no language info.
- property file_name: str#
Effective filename used in document labels.
Returns
filename_overridewhen set; otherwise returnsinput_file.name.- Returns:
- str
File name string (not a full path).
Examples
>>> from pathlib import Path >>> reader = TextReader(input_file=Path("/data/corpus.txt")) >>> reader.file_name 'corpus.txt'
- file_type: ClassVar[str] = '.zip'#
Single file extension this reader handles (lowercase, including leading dot). E.g.
".txt",".xml",".zip".For readers that handle multiple extensions, define
file_types(plural) instead. Exactly one offile_typeorfile_typesmust be defined on every concrete subclass.
- file_types: ClassVar[list[str] | None] = ['.zip']#
List of file extensions this reader handles (lowercase, leading dot). Use instead of
file_typewhen a single reader class should be registered for several extensions — e.g. an image reader for[".png", ".jpg", ".jpeg", ".gif", ".webp"].When both
file_typeandfile_typesare defined on the same class,file_typestakes precedence andfile_typeis ignored.
- filter_: FilterBase | None = None#
Filter applied after chunking.
Nonetriggers theDefaultFilter.
- classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#
Build a
_MultiSourceReaderfrom a manifest file.The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with
#are ignored. JSON manifests (a list of strings or objects) are also supported.- Parameters:
- manifest_pathpathlib.Path or str
Path to the manifest file. Supported formats:
.txt/.manifest— one source per line..json— a JSON array of strings (sources) or objects with at least a"source"key (and optional"source_type","source_title"per-entry overrides).
- chunkerChunkerBase or None, optional
Chunker applied to all sources. Default:
None.- filter_FilterBase or None, optional
Filter applied to all sources. Default:
None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Override source type for all sources. Default:
None.- source_titlestr or None, optional
Override title for all sources. Default:
None.- source_authorstr or None, optional
Override author for all sources. Default:
None.- source_datestr or None, optional
Override date for all sources. Default:
None.- collection_idstr or None, optional
Collection identifier. Default:
None.- doistr or None, optional
DOI override. Default:
None.- isbnstr or None, optional
ISBN override. Default:
None.- encodingstr, optional
Text encoding for
.txtmanifests. Default:"utf-8".- **kwargsAny
Forwarded to each reader constructor.
- Returns:
- _MultiSourceReader
Multi-source reader chaining all manifest entries.
- Raises:
- ValueError
If manifest_path does not exist or is empty after filtering blank and comment lines.
- ValueError
If the manifest format is not recognised.
- Parameters:
- Return type:
Notes
Per-entry overrides in JSON manifests: each entry may be an object with:
{ "source": "https://example.com/report.pdf", "source_type": "research", "source_title": "Annual Report 2024", }
String-level
source_typevalues are coerced viaSourceType(value)and an invalid value raisesValueError.Examples
Text manifest
sources.txt:# WHO corpus https://www.who.int/europe/news/item/... https://youtu.be/rwPISgZcYIk WHO-EURO-2025.pdf scan.jpg
Usage:
reader = DocumentReader.from_manifest( Path("sources.txt"), collection_id="who-corpus", ) docs = list(reader.get_documents())
- classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for a URL source.
Dispatches to
YouTubeReaderfor YouTube URLs and toWebReaderfor all otherhttp:///https://URLs.- Parameters:
- urlstr
Full URL string. Must start with
http://orhttps://.- chunkerChunkerBase or None, optional
Chunker to inject. Default:
None.- filter_FilterBase or None, optional
Filter to inject. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override for the
source_filelabel. Default:None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Semantic label for the source. Default:
None.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date in ISO 8601 format. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- **kwargsAny
Additional kwargs forwarded to the reader constructor (e.g.
include_auto_generated=FalseforYouTubeReader).
- Returns:
- DocumentReader
YouTubeReaderorWebReaderinstance.
- Raises:
- ValueError
If
urldoes not start withhttp://orhttps://.- ImportError
If the required reader class is not registered (i.e.
scikitplot.corpus._readershas not been imported yet).
- Parameters:
url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
Prefer :meth:`create` for new code. Passing a URL string to
createautomatically callsfrom_url— you rarely need to callfrom_urldirectly.Examples
>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python") >>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk") >>> docs = list(yt.get_documents())
- get_documents()[source]#
Yield validated
CorpusDocumentinstances for the input file.Orchestrates the full per-file pipeline:
validate_input— fail fast if file is missing.get_raw_chunks— format-specific text extraction.Chunker (if set) — sub-segments each raw block.
CorpusDocumentconstruction with validated schema.Filter — discards noise documents.
- Yields:
- CorpusDocument
Validated documents that passed the filter.
- Raises:
- ValueError
If the input file is missing or the format is invalid.
- Return type:
Generator[CorpusDocument, None, None]
Notes
The global
chunk_indexcounter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that(source_file, chunk_index)is a unique key within one reader run.Omitted-document statistics are logged at INFO level after processing each file.
Examples
>>> from pathlib import Path >>> reader = DocumentReader.create(Path("corpus.txt")) >>> docs = list(reader.get_documents()) >>> all(isinstance(d, CorpusDocument) for d in docs) True
- get_raw_chunks()[source]#
Extract ZIP and yield raw chunks from all supported members.
Each member is dispatched to
DocumentReader.create, which selects the appropriate reader by file extension. The member’s raw chunks are yielded inline, as if the member files had been passed directly.- Yields:
- dict[str, Any]
Raw chunk dicts from each member reader’s
get_raw_chunks()call. Thesource_filekey is set to"<archive_name>/<member_name>"for provenance.
- Raises:
- ValueError
If the archive contains more than
max_filesmembers, or if cumulative extracted size exceedsmax_total_bytes, or if a member has a path-traversal component (ZipSlip).- OSError
If the archive cannot be opened or a member cannot be read.
- Return type:
Notes
Temporary extraction happens inside a
tempfile.mkdtemp()directory that is removed on exit (even on exception) via atry/finallyblock. The extracted files are read and their chunks forwarded; the files themselves are not streamed — each member is fully extracted before its reader is called.
- infer_source_type: bool = True#
Auto-infer
source_typefor each member viaSourceType.inferwhen the caller did not supplysource_typeinsource_provenance. Default:True.
- input_file: Path[source]#
Path to the source file.
For URL-based readers (
WebReader,YouTubeReader), passpathlib.Path(url_string)here and setsource_urito the original URL string.validate_input()is overridden in those subclasses to skip the file-existence check.
- reader_kwargs: dict[str, dict[str, Any]][source]#
Per-extension keyword arguments forwarded to sub-reader constructors.
Enables reader-specific options for individual file types inside the archive. Keys are lower-case file extensions (with leading dot), values are dicts of kwargs forwarded to the corresponding reader.
Example — transcribe MP3 files but not others:
ZipReader( input_file=Path("corpus.zip"), reader_kwargs={ ".mp3": {"transcribe": True, "whisper_model": "small"}, ".mp4": {"transcribe": True}, ".jpg": {"backend": "easyocr"}, }, )
Global kwargs passed to the
ZipReaderconstructor (**kwargs) are merged under per-extension overrides, so per-extension values always win.
- source_provenance: dict[str, Any][source]#
Provenance overrides propagated into every yielded
CorpusDocument.Keys may include
"source_type","source_title","source_author", and"collection_id". Populated bycreate/from_urlfrom their keyword arguments.
- source_uri: str | None = None#
Original URI for URL-based readers (web pages, YouTube videos).
Set this to the full URL string when
input_fileis a syntheticpathlib.Pathwrapping a URL. File-based readers leave thisNone.Examples
>>> reader = WebReader( ... input_file=Path("https://example.com/article"), ... source_uri="https://example.com/article", ... )
- classmethod subclass_by_type()[source]#
Return a copy of the extension → reader class registry.
- Returns:
- dict
Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.
- Return type:
Examples
>>> registry = DocumentReader.subclass_by_type() >>> ".txt" in registry True
- classmethod supported_types()[source]#
Return a sorted list of file extensions supported by registered readers.
- Returns:
- list of str
Lowercase file extensions, each including the leading dot. E.g.
['.pdf', '.txt', '.xml', '.zip'].
- Return type:
Examples
>>> DocumentReader.supported_types() ['.pdf', '.txt', '.xml', '.zip']
- validate_input()[source]#
Assert that the input file exists and is readable.
- Raises:
- ValueError
If
input_filedoes not exist or is not a regular file.
- Return type:
None
Notes
Called automatically by
get_documentsbefore iterating. Can also be called eagerly after construction to fail fast.Examples
>>> reader = DocumentReader.create(Path("missing.txt")) >>> reader.validate_input() Traceback (most recent call last): ... ValueError: Input file does not exist: missing.txt
Gallery examples#
corpus WHO European Region local .zip with examples