ImageReader#

class scikitplot.corpus.ImageReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, backend='tesseract', ocr_lang=None, min_confidence=None, max_file_bytes=104857600, preprocess_grayscale=False, yield_raw=False, yield_raw_bytes=False)[source]#

OCR-based text extraction from raster image files.

Iterates over all frames in the image (for multi-frame GIF and TIFF), runs OCR on each, and yields one raw chunk dict per frame. Single-frame images yield exactly one chunk.

Parameters:
input_filepathlib.Path

Path to the image file.

backendstr, optional

OCR backend to use. One of "tesseract" (default) or "easyocr". "tesseract" requires the Tesseract binary on PATH plus pip install pytesseract. "easyocr" requires pip install easyocr and downloads model weights on first use.

ocr_langstr or None, optional

Language hint for the OCR engine.

  • For Tesseract: standard Tesseract language string, e.g. "eng", "deu", "eng+deu".

  • For easyocr: ISO 639-1 language code, e.g. "en", "de".

None uses each backend’s default (usually English).

min_confidencefloat, optional

Minimum mean OCR confidence in [0.0, 1.0] (both backends). Chunks with lower confidence are still yielded but logged at DEBUG with a warning flag. Set to None to disable threshold-based logging. Default: None.

max_file_bytesint, optional

Maximum file size in bytes. Files larger than this limit raise ValueError before any bytes are read. Default: 100 MB.

preprocess_grayscalebool, optional

When True, convert each frame to grayscale ("L" mode) before OCR. This often improves Tesseract accuracy on coloured backgrounds. Default: False.

chunkerChunkerBase or None, optional

Inherited from DocumentReader.

filter_FilterBase or None, optional

Inherited from DocumentReader.

filename_overridestr or None, optional

Inherited from DocumentReader.

default_languagestr or None, optional

Inherited from DocumentReader.

Attributes:
file_typeslist of str

Class variable. Registered extensions: [".png", ".jpg", ".jpeg", ".gif", ".webp", ".tiff", ".tif", ".bmp"].

Raises:
ValueError

If backend is not one of the supported values.

ValueError

If the file exceeds max_file_bytes.

ImportError

If the required OCR library (or Pillow) is not installed.

Parameters:
  • input_file (Path)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_uri (str | None)

  • source_provenance (dict[str, Any])

  • backend (str)

  • ocr_lang (str | None)

  • min_confidence (float | None)

  • max_file_bytes (int)

  • preprocess_grayscale (bool)

  • yield_raw (bool)

  • yield_raw_bytes (bool)

See also

scikitplot.corpus._readers.VideoReader

Video transcription reader.

scikitplot.corpus._readers.PDFReader

PDF text extraction reader.

Notes

Tesseract accuracy tips:

  • Use preprocess_grayscale=True for images with coloured text backgrounds.

  • Install additional Tesseract language packs for non-English corpora: apt-get install tesseract-ocr-deu (German), etc.

  • Very low-resolution images (< 150 DPI) tend to produce poor results. Consider upscaling with Pillow before passing to the reader.

easyocr note: Model weights (~100 MB per language) are downloaded automatically on first use. This is a side effect; in CI/Docker pipelines, pre-cache the weights or use Tesseract instead.

Examples

Default Tesseract backend:

>>> from pathlib import Path
>>> reader = ImageReader(input_file=Path("scan.png"), ocr_lang="eng")
>>> docs = list(reader.get_documents())
>>> print(docs[0].text[:100])

Multi-page TIFF:

>>> reader = ImageReader(input_file=Path("document.tiff"))
>>> docs = list(reader.get_documents())
>>> print(f"Extracted {len(docs)} pages")
backend: str = 'tesseract'#

OCR backend. One of "tesseract" (default) or "easyocr".

chunker: ChunkerBase | None = None#

Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:
*inputspathlib.Path or str

One or more source paths or URL strings. Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the source_file label. Only applied when inputs contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:
DocumentReader

A single reader when inputs has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when inputs has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:
ValueError

If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.

TypeError

If any element of inputs is not a str or pathlib.Path.

Parameters:
  • inputs (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | list[SourceType | None] | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple inputs with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three
default_language: str | None = None#

ISO 639-1 language code to assign when the source has no language info.

property file_name: str#

Effective filename used in document labels.

Returns filename_override when set; otherwise returns input_file.name.

Returns:
str

File name string (not a full path).

Examples

>>> from pathlib import Path
>>> reader = TextReader(input_file=Path("/data/corpus.txt"))
>>> reader.file_name
'corpus.txt'
file_type: ClassVar[str | None] = None#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None] = ['.png', '.jpg', '.jpeg', '.gif', '.webp', '.tiff', '.tif', '.bmp']#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#

Override for the source_file label in generated documents.

filter_: FilterBase | None = None#

Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:
manifest_pathpathlib.Path or str

Path to the manifest file. Supported formats:

  • .txt / .manifest — one source per line.

  • .json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:
_MultiSourceReader

Multi-source reader chaining all manifest entries.

Raises:
ValueError

If manifest_path does not exist or is empty after filtering blank and comment lines.

ValueError

If the manifest format is not recognised.

Parameters:
  • manifest_path (Path | str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • encoding (str)

  • kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())
classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:
urlstr

Full URL string. Must start with http:// or https://.

chunkerChunkerBase or None, optional

Chunker to inject. Default: None.

filter_FilterBase or None, optional

Filter to inject. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override for the source_file label. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Semantic label for the source. Default: None.

source_titlestr or None, optional

Title of the source work. Default: None.

source_authorstr or None, optional

Primary author. Default: None.

source_datestr or None, optional

Publication date in ISO 8601 format. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier. Default: None.

isbnstr or None, optional

International Standard Book Number. Default: None.

**kwargsAny

Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:
DocumentReader

YouTubeReader or WebReader instance.

Raises:
ValueError

If url does not start with http:// or https://.

ImportError

If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:
  • url (str)

  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • filename_override (str | None)

  • default_language (str | None)

  • source_type (SourceType | None)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • doi (str | None)

  • isbn (str | None)

  • kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
>>> docs = list(yt.get_documents())
get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

  1. validate_input — fail fast if file is missing.

  2. get_raw_chunks — format-specific text extraction.

  3. Chunker (if set) — sub-segments each raw block.

  4. CorpusDocument construction with validated schema.

  5. Filter — discards noise documents.

Yields:
CorpusDocument

Validated documents that passed the filter.

Raises:
ValueError

If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (source_file, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True
get_raw_chunks()[source]#

Run OCR on each frame of the image and yield one chunk per frame.

Yields:
dict

Keys:

"source_type"

Always IMAGE. Promoted to CorpusDocument.source_type.

"text"

OCR-extracted text for this frame.

"section_type"

Always TEXT.

"page_number"

Zero-based frame/page index within the file. Promoted to CorpusDocument.page_number.

"image_width"

Width of the frame in pixels (goes to metadata).

"image_height"

Height of the frame in pixels (goes to metadata).

"confidence"

Mean confidence score in [0.0, 1.0] (both backends). Promoted to CorpusDocument.confidence.

"ocr_engine"

Name of the backend that produced this chunk. Promoted to CorpusDocument.ocr_engine.

"total_frames"

Total number of frames/pages in the image file (goes to metadata).

Raises:
ValueError

If the file exceeds max_file_bytes.

ImportError

If Pillow or the OCR library is not installed.

Return type:

Generator[dict[str, Any], None, None]

input_file: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

max_file_bytes: int = 104857600#

100 MB.

Type:

Maximum file size in bytes. Default

min_confidence: float | None = None#

Minimum OCR confidence for debug logging. None disables logging.

ocr_lang: str | None = None#

Language hint for the OCR engine. None uses the backend default.

preprocess_grayscale: bool = False#

Convert frames to grayscale before OCR when True.

source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_file is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_file=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )
classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:
dict

Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True
classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:
list of str

Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']
validate_input()[source]#

Assert that the input file exists and is readable.

Raises:
ValueError

If input_file does not exist or is not a regular file.

Return type:

None

Notes

Called automatically by get_documents before iterating. Can also be called eagerly after construction to fail fast.

Examples

>>> reader = DocumentReader.create(Path("missing.txt"))
>>> reader.validate_input()
Traceback (most recent call last):
    ...
ValueError: Input file does not exist: missing.txt
yield_raw: bool = False#

Include raw pixel array in output chunks.

When True, each chunk includes "raw_tensor" (numpy ndarray, shape (H, W, C) uint8 RGB) and sets "modality" to "multimodal" (when OCR text is also present) or "image" (when text is empty). Requires numpy (always available with Pillow). Default: False.

yield_raw_bytes: bool = False#

Include raw encoded file bytes in output chunks.

When True, each chunk includes "raw_bytes" containing the compressed image bytes (JPEG/PNG/etc.) — useful for tf.io.decode_image or CV2 imdecode. Default: False.