ZipReader#

class scikitplot.corpus.ZipReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, max_files=10000, max_total_bytes=2147483648, skip_unsupported=True, infer_source_type=True, reader_kwargs=<factory>)[source]#

Generic ZIP archive reader — dispatches each member to its natural reader.

Extracts all supported members from a .zip archive into a temporary directory, then calls DocumentReader.create on each file. Documents from all members are yielded in a single stream.

This reader intentionally overrides ALTOReader's ".zip" registration. To use ALTOReader directly, instantiate it explicitly.

Parameters:
input_filepathlib.Path

Path to the .zip archive.

max_filesint, optional

Maximum number of files allowed in the archive. Archives exceeding this limit raise ValueError before extraction begins. Default: 10,000.

max_total_bytesint, optional

Maximum cumulative uncompressed size. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.

skip_unsupportedbool, optional

When True (default), members whose extension is not registered in DocumentReader are silently skipped. When False, an unregistered extension raises ValueError.

Parameters:

Notes

ALTOReader coexistence: ALTOReader registers on ".zip" and is used when ZipReader is not imported. Importing scikitplot.corpus._readers triggers both imports; ZipReader is imported after ALTOReader, so its ".zip" registration wins. To use ALTOReader on a known ALTO archive, instantiate it directly:

reader = ALTOReader(input_file=Path("alto_corpus.zip"))

Temporary directory: Extracted files land in tempfile.mkdtemp(). The directory is deleted automatically when get_documents() completes (including on exception). If you iterate manually without exhausting the generator, call reader.close() to clean up.

Reader kwargs forwarding: All constructor kwargs beyond the explicit parameters are forwarded to each sub-reader constructed for the members. This means you can pass transcribe=True and it will reach any AudioReader or VideoReader instances created for audio/video members.

Examples

Any ZIP containing a mix of supported files:

>>> from pathlib import Path
>>> from scikitplot.corpus._base import DocumentReader
>>> import scikitplot.corpus._readers
>>> reader = DocumentReader.create(Path("corpus.zip"))
>>> type(reader).__name__
\'ZipReader\'
>>> docs = list(reader.get_documents())

Explicit instantiation with custom limits:

>>> reader = ZipReader(
...     input_file=Path("large_corpus.zip"),
...     max_files=500,
...     max_total_bytes=500 * 1024 * 1024,
... )
chunker: ChunkerBase | None = None#

Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:
*inputspathlib.Path or str

One or more source paths or URL strings. Each element is classified independently:

  • str matching ^https?:// (case-insensitive) — treated as a URL and routed to from_url. Must be passed as a plain ``str``, not wrapped in pathlib.Path; wrapping collapses the double-slash (https://https:/) and breaks URL detection.

  • str not matching the URL pattern — treated as a local file path and converted to pathlib.Path internally.

  • pathlib.Path — always treated as a local file path and dispatched by extension via the reader registry.

Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents in order.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the source_file label. Only applied when inputs contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:
DocumentReader

A single reader when inputs has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when inputs has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:
ValueError

If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.

TypeError

If any element of inputs is not a str or pathlib.Path.

Parameters:
  • inputs (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | list[SourceType | None] | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple inputs with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three
default_language: str | None = None#

ISO 639-1 language code to assign when the source has no language info.

property file_name: str#

Effective filename used in document labels.

Returns filename_override when set; otherwise returns input_file.name.

Returns:
str

File name string (not a full path).

Examples

>>> from pathlib import Path
>>> reader = TextReader(input_file=Path("/data/corpus.txt"))
>>> reader.file_name
'corpus.txt'
file_type: ClassVar[str] = '.zip'#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None] = ['.zip']#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#

Override for the source_file label in generated documents.

filter_: FilterBase | None = None#

Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:
manifest_pathpathlib.Path or str

Path to the manifest file. Supported formats:

  • .txt / .manifest — one source per line.

  • .json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:
_MultiSourceReader

Multi-source reader chaining all manifest entries.

Raises:
ValueError

If manifest_path does not exist or is empty after filtering blank and comment lines.

ValueError

If the manifest format is not recognised.

Parameters:
  • manifest_path (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • encoding (str)

  • kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())
classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:
urlstr

Full URL string. Must start with http:// or https://.

chunkerChunkerBase or None, optional

Chunker to inject. Default: None.

filter_FilterBase or None, optional

Filter to inject. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override for the source_file label. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Semantic label for the source. Default: None.

source_titlestr or None, optional

Title of the source work. Default: None.

source_authorstr or None, optional

Primary author. Default: None.

source_datestr or None, optional

Publication date in ISO 8601 format. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier. Default: None.

isbnstr or None, optional

International Standard Book Number. Default: None.

**kwargsAny

Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:
DocumentReader

YouTubeReader or WebReader instance.

Raises:
ValueError

If url does not start with http:// or https://.

ImportError

If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:
  • url (str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> docs = list(yt.get_documents())
get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

  1. validate_input — fail fast if file is missing.

  2. get_raw_chunks — format-specific text extraction.

  3. Chunker (if set) — sub-segments each raw block.

  4. CorpusDocument construction with validated schema.

  5. Filter — discards noise documents.

Yields:
CorpusDocument

Validated documents that passed the filter.

Raises:
ValueError

If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (source_file, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True
get_raw_chunks()[source]#

Extract ZIP and yield raw chunks from all supported members.

Each member is dispatched to DocumentReader.create, which selects the appropriate reader by file extension. The member’s raw chunks are yielded inline, as if the member files had been passed directly.

Yields:
dict[str, Any]

Raw chunk dicts from each member reader’s get_raw_chunks() call. The source_file key is set to "<archive_name>/<member_name>" for provenance.

Raises:
ValueError

If the archive contains more than max_files members, or if cumulative extracted size exceeds max_total_bytes, or if a member has a path-traversal component (ZipSlip).

OSError

If the archive cannot be opened or a member cannot be read.

Return type:

Generator[dict[str, Any], None, None]

Notes

Temporary extraction happens inside a tempfile.mkdtemp() directory that is removed on exit (even on exception) via a try/finally block. The extracted files are read and their chunks forwarded; the files themselves are not streamed — each member is fully extracted before its reader is called.

infer_source_type: bool = True#

Auto-infer source_type for each member via SourceType.infer when the caller did not supply source_type in source_provenance. Default: True.

input_file: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

max_files: int = 10000#

10,000.

Type:

Maximum file count inside the archive. Default

max_total_bytes: int = 2147483648#

2 GB.

Type:

Maximum cumulative extracted size. Default

reader_kwargs: dict[str, dict[str, Any]][source]#

Per-extension keyword arguments forwarded to sub-reader constructors.

Enables reader-specific options for individual file types inside the archive. Keys are lower-case file extensions (with leading dot), values are dicts of kwargs forwarded to the corresponding reader.

Example — transcribe MP3 files but not others:

ZipReader(
    input_file=Path("corpus.zip"),
    reader_kwargs={
        ".mp3": {"transcribe": True, "whisper_model": "small"},
        ".mp4": {"transcribe": True},
        ".jpg": {"backend": "easyocr"},
    },
)

Global kwargs passed to the ZipReader constructor (**kwargs) are merged under per-extension overrides, so per-extension values always win.

skip_unsupported: bool = True#

Skip members with unregistered extensions instead of raising.

source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_file is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_file=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )
classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:
dict

Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True
classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:
list of str

Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']
validate_input()[source]#

Assert that the input file exists and is readable.

Raises:
ValueError

If input_file does not exist or is not a regular file.

Return type:

None

Notes

Called automatically by get_documents before iterating. Can also be called eagerly after construction to fail fast.

Examples

>>> reader = DocumentReader.create(Path("missing.txt"))
>>> reader.validate_input()
Traceback (most recent call last):
    ...
ValueError: Input file does not exist: missing.txt