ZipReader#

class scikitplot.corpus.ZipReader(input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, custom_extractor=None, custom_extractor_kwargs=<factory>, max_files=10000, max_total_bytes=2147483648, skip_unsupported=True, infer_source_type=True, reader_kwargs=<factory>)[source]#

Generic ZIP archive reader — dispatches each member to its natural reader.

Extracts all supported members from a .zip archive into a temporary directory, then calls DocumentReader.create on each file. Documents from all members are yielded in a single stream.

This reader intentionally overrides ALTOReader's ".zip" registration. To use ALTOReader directly, instantiate it explicitly.

Parameters:

input_pathpathlib.Path: Path to the .zip archive.
max_filesint, optional: Maximum number of files allowed in the archive. Archives exceeding this limit raise ValueError before extraction begins. Default: 10,000.
max_total_bytesint, optional: Maximum cumulative uncompressed size. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.
skip_unsupportedbool, optional: When True (default), members whose extension is not registered in DocumentReader are silently skipped. When False, an unregistered extension raises ValueError.

Parameters:

input_path (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
source_provenance (dict[str, Any])
custom_extractor (Any | None)
custom_extractor_kwargs (dict[str, Any])
max_files (int)
max_total_bytes (int)
skip_unsupported (bool)
infer_source_type (bool)
reader_kwargs (dict[str, dict[str, Any]])

Notes

ALTOReader coexistence: ALTOReader registers on ".zip" and is used when ZipReader is not imported. Importing scikitplot.corpus._readers triggers both imports; ZipReader is imported after ALTOReader, so its ".zip" registration wins. To use ALTOReader on a known ALTO archive, instantiate it directly:

reader = ALTOReader(input_path=Path("alto_corpus.zip"))

Temporary directory: Extracted files land in tempfile.mkdtemp(). The directory is deleted automatically when get_documents() completes (including on exception). If you iterate manually without exhausting the generator, call reader.close() to clean up.

Reader kwargs forwarding: All constructor kwargs beyond the explicit parameters are forwarded to each sub-reader constructed for the members. This means you can pass transcribe=True and it will reach any AudioReader or VideoReader instances created for audio/video members.

Developer note:

ZipReader is intentionally not recursive for nested ZIPs. An inner .zip file will be dispatched back to ZipReader (since it is now the registered handler for ".zip"), achieving one level of recursion per nesting depth without any special-case logic. This is safe because each level uses its own temporary directory.

Examples

Any ZIP containing a mix of supported files:

>>> from pathlib import Path
>>> from scikitplot.corpus._base import DocumentReader
>>> import scikitplot.corpus._readers
>>> reader = DocumentReader.create(Path("corpus.zip"))
>>> type(reader).__name__
\'ZipReader\'
>>> docs = list(reader.get_documents())

Explicit instantiation with custom limits:

>>> reader = ZipReader(
...     input_path=Path("large_corpus.zip"),
...     max_files=500,
...     max_total_bytes=500 * 1024 * 1024,
... )

chunker: ChunkerBase | None = None#: Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classmethod create(*input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:

*input_pathstr or pathlib.Path

One or more source paths or URL strings. Each element is classified independently:

str matching ^https?:// (case-insensitive) — treated as a URL and routed to from_url. Must be passed as a plain ``str``, not wrapped in pathlib.Path; wrapping collapses the double-slash (https:// → https:/) and breaks URL detection.
str not matching the URL pattern — treated as a local file path and converted to pathlib.Path internally.
pathlib.Path — always treated as a local file path and dispatched by extension via the reader registry.

Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents in order.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the input_path label. Only applied when input_path contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When input_path has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:

DocumentReader: A single reader when input_path has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when input_path has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:

ValueError: If input_path is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
TypeError: If any element of input_path is not a str or pathlib.Path.

Parameters:

input_path (str | pathlib.Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple input_path with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

transcribe=True, whisper_model="small" → AudioReader, VideoReader
backend="easyocr" → ImageReader
prefer_backend="pypdf" → PDFReader
classify=True, classifier=fn → AudioReader

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three

custom_extractor: Any | None = None#

User-supplied extraction callable that replaces get_raw_chunks entirely for this reader instance.

When set, _iter_raw_chunks calls custom_extractor(self.input_path, **custom_extractor_kwargs) and normalises the return value through normalize_extractor_output. The built-in get_raw_chunks implementation is not called.

This hook is available on every reader class (ALTOReader, TextReader, PDFReader, ImageReader, etc.) without any subclassing — simply pass a callable at construction time.

Callable contract

def my_extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput

where ExtractorOutput is str, list[str], dict, or list[dict] — the same contract as CustomReader.

Examples

Override PDF extraction with pdfplumber for a single reader:

import pdfplumber
from pathlib import Path
from scikitplot.corpus._base import DocumentReader

def plumber_fn(path, **kw):
    with pdfplumber.open(path) as pdf:
        return [{"text": p.extract_text() or "", "page_number": i}
                for i, p in enumerate(pdf.pages)]

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=plumber_fn,
)
docs = list(reader.get_documents())

custom_extractor_kwargs: dict[str, Any][source]#

Extra keyword arguments forwarded to custom_extractor on every invocation. Merged into the call as **custom_extractor_kwargs.

Examples

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=my_fn,
    custom_extractor_kwargs={"password": "s3cret", "pages": [0, 1, 2]},
)

default_language: str | None = None#: ISO 639-1 language code to assign when the source has no language info.

property file_name: str#

Effective filename used in document labels.

Returns filename_override when set; otherwise returns input_path.name.

Returns:

str: File name string (not a full path).

Examples

>>> from pathlib import Path
>>> reader = TextReader(input_path=Path("/data/corpus.txt"))
>>> reader.file_name
'corpus.txt'

file_type: ClassVar[str | None][source]#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None] = ['.zip']#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#: Override for the input_path label in generated documents.

filter_: FilterBase | None = None#: Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:

manifest_pathstr or pathlib.Path

Path to the manifest file. Supported formats:

.txt / .manifest — one source per line.
.json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:

_MultiSourceReader: Multi-source reader chaining all manifest entries.

Raises:

ValueError: If manifest_path does not exist or is empty after filtering blank and comment lines.
ValueError: If the manifest format is not recognised.

Parameters:

manifest_path (str | Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
encoding (str)
kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())

classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:

urlstr: Full URL string. Must start with http:// or https://.
chunkerChunkerBase or None, optional: Chunker to inject. Default: None.
filter_FilterBase or None, optional: Filter to inject. Default: None (DefaultFilter).
filename_overridestr or None, optional: Override for the input_path label. Default: None.
default_languagestr or None, optional: ISO 639-1 language code. Default: None.
source_typeSourceType or None, optional: Semantic label for the source. Default: None.
source_titlestr or None, optional: Title of the source work. Default: None.
source_authorstr or None, optional: Primary author. Default: None.
source_datestr or None, optional: Publication date in ISO 8601 format. Default: None.
collection_idstr or None, optional: Corpus collection identifier. Default: None.
doistr or None, optional: Digital Object Identifier. Default: None.
isbnstr or None, optional: International Standard Book Number. Default: None.
**kwargsAny: Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:

DocumentReader: YouTubeReader or WebReader instance.

Raises:

ValueError: If url does not start with http:// or https://.
ImportError: If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:

url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())

>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> docs = list(yt.get_documents())

get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

validate_input — fail fast if file is missing.
get_raw_chunks — format-specific text extraction.
Chunker (if set) — sub-segments each raw block.
CorpusDocument construction with validated schema.
Filter — discards noise documents.

Yields:

CorpusDocument: Validated documents that passed the filter.

Raises:

ValueError: If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (input_path, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True

get_raw_chunks()[source]#

Extract ZIP and yield raw chunks from all supported members.

Each member is dispatched to DocumentReader.create, which selects the appropriate reader by file extension. The member’s raw chunks are yielded inline, as if the member files had been passed directly.

Yields:

dict[str, Any]: Raw chunk dicts from each member reader’s get_raw_chunks() call. The input_path key is set to "<archive_name>/<member_name>" for provenance.

Raises:

ValueError: If the archive contains more than max_files members, or if cumulative extracted size exceeds max_total_bytes, or if a member has a path-traversal component (ZipSlip).
OSError: If the archive cannot be opened or a member cannot be read.

Return type:

Generator[dict[str, Any], None, None]

Notes

Temporary extraction happens inside a tempfile.mkdtemp() directory that is removed on exit (even on exception) via a try/finally block. The extracted files are read and their chunks forwarded; the files themselves are not streamed — each member is fully extracted before its reader is called.

infer_source_type: bool = True#: Auto-infer source_type for each member via SourceType.infer when the caller did not supply source_type in source_provenance. Default: True.

input_path: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

max_files: int = 10000#

10,000.

Type:: Maximum file count inside the archive. Default

max_total_bytes: int = 2147483648#

Maximum cumulative uncompressed bytes across all extracted ZIP members.

BUG-10 clarification: This is a different guard than max_file_bytes in ALTOReader.

ZipReader.max_total_bytes — sum of all member sizes after extraction. Guards against ZIP bomb attacks where many small compressed members expand to a huge total.
ALTOReader.max_file_bytes — size of the ZIP archive file on disk before opening. A coarser, pre-flight guard.

Both limits exist because different threat models apply: pre-download (file size) vs post-extraction (total expanded size). Default: 10 GB.

reader_kwargs: dict[str, dict[str, Any]][source]#

Per-extension keyword arguments forwarded to sub-reader constructors.

Enables reader-specific options for individual file types inside the archive. Keys are lower-case file extensions (with leading dot), values are dicts of kwargs forwarded to the corresponding reader.

Example — transcribe MP3 files but not others:

ZipReader(
    input_path=Path("corpus.zip"),
    reader_kwargs={
        ".mp3": {"transcribe": True, "whisper_model": "small"},
        ".mp4": {"transcribe": True},
        ".jpg": {"backend": "easyocr"},
    },
)

Global kwargs passed to the ZipReader constructor (**kwargs) are merged under per-extension overrides, so per-extension values always win.

skip_unsupported: bool = True#: Skip members with unregistered extensions instead of raising.

source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_path is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_path=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )

classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:

dict: Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True

classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:

list of str: Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']

validate_input()[source]#

Assert that the input file exists and is readable.

Raises:

ValueError: If input_path does not exist or is not a regular file.

Return type:

None

Notes

Called automatically by get_documents before iterating. Can also be called eagerly after construction to fail fast.

Examples

>>> reader = DocumentReader.create(Path("missing.txt"))
>>> reader.validate_input()
Traceback (most recent call last):
    ...
ValueError: Input file does not exist: missing.txt

Gallery examples#

corpus WHO European Region local .zip with examples