CorpusDocument#

class scikitplot.corpus.CorpusDocument(doc_id, input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, modality=<factory>, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None, metadata=<factory>, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None)[source]#

Canonical representation of a single text chunk in a processed corpus.

A CorpusDocument is the unit of data that flows between every stage of the pipeline: readers produce them, chunkers subdivide them, filters accept or reject them, embedders enrich them, and exporters serialise them.

Parameters:

doc_idstr: Stable 16-character hex identifier. Generated deterministically from (source_type, input_path, chunk_index, text[:64]) via make_doc_id if not supplied. Must be non-empty.
input_pathstr: Name or relative path of the original source file. Must be non-empty. Set from input_path.name by readers; do not include absolute paths to keep corpora portable across machines.
chunk_indexint: Zero-based ordinal of this chunk within the source document. Must be >= 0. Unique per (input_path, chunking_strategy) pair.
textstr: Cleaned, segmented text content of this chunk. Must be non-empty after stripping whitespace.
section_typeSectionType, optional: Semantic role of this chunk within its source document. Default: SectionType.TEXT.
chunking_strategyChunkingStrategy, optional: Strategy used to produce this chunk. Default: ChunkingStrategy.SENTENCE.
languagestr or None, optional: ISO 639-1 language code. Default: None.
char_startint or None, optional: Character offset of chunk start within the original document. Default: None.
char_endint or None, optional: Character offset of chunk end (exclusive). Default: None.
embeddingarray-like or None, optional: Dense vector representation of text. Stored as Any at runtime; the .pyi stub provides NDArray[float32] for type checkers. Default: None.
metadatadict, optional: Open-ended key-value store for truly ad-hoc or format-specific fields (ISBN edition, translator, speaker, etc.). All keys must be strings. Default: empty dict.
source_typeSourceType, optional: Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …). Used as a typed pre-filter column. Default: SourceType.UNKNOWN.
source_titlestr or None, optional: Title of the source work. Default: None.
source_authorstr or None, optional: Primary author. Default: None.
source_datestr or None, optional: Publication date in ISO 8601 format. Default: None.
collection_idstr or None, optional: Identifier grouping related sources into one corpus. Default: None.
urlstr or None, optional: Source URL for web-fetched documents. Default: None.
doistr or None, optional: Digital Object Identifier. Default: None.
isbnstr or None, optional: International Standard Book Number. Default: None.
page_numberint or None, optional: Zero-based page index. Default: None.
paragraph_indexint or None, optional: Zero-based paragraph index within the page or document. Default: None.
line_numberint or None, optional: Zero-based line number. Default: None.
parent_doc_idstr or None, optional: doc_id of the parent chunk when this is a sub-division. Default: None.
actint or None, optional: Act number (one-based) in a dramatic source. Default: None.
scene_numberint or None, optional: Scene number (one-based) within an act. Default: None.
timecode_startfloat or None, optional: Start timecode in seconds (>= 0). Default: None.
timecode_endfloat or None, optional: End timecode in seconds (>= timecode_start). Default: None.
confidencefloat or None, optional: OCR or ASR confidence in [0.0, 1.0]. Default: None.
ocr_enginestr or None, optional: Name of the OCR engine used. Default: None.
bboxtuple of float or None, optional: Bounding box (x0, y0, x1, y1). Must be a 4-tuple of floats. Default: None.
normalized_textstr or None, optional: Normalised text used by the embedding engine. Default: None.
tokenslist of str or None, optional: Tokenised word list (not included in repr or equality). Default: None.
lemmaslist of str or None, optional: Lemmatised tokens (not included in repr or equality). Default: None.
stemslist of str or None, optional: Stemmed tokens (not included in repr or equality). Default: None.
keywordslist of str or None, optional: Extracted keyphrases (not included in repr or equality). Default: None.

Attributes:

REQUIRED_FIELDStuple of str: Class-level tuple of field names that must be non-empty/non-negative for validate to pass.

Raises:

ValueError: If validate is called and any invariant is violated.

Parameters:

doc_id (str)
input_path (str)
chunk_index (int)
text (str)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
modality (Modality)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_shape (tuple[int, ...] | None)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)
metadata (dict[str, Any])
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
bbox (tuple[float, ...] | None)
normalized_text (str | None)
tokens (list[str] | None)
lemmas (list[str] | None)
stems (list[str] | None)
keywords (list[str] | None)

See also

scikitplot.corpus._base.DocumentReader: Produces CorpusDocuments.
scikitplot.corpus._pipeline.CorpusPipeline: Orchestrates the full flow.

Notes

Immutability convention: CorpusDocument is a mutable dataclass for performance, but pipeline stages must not mutate documents in-place after yielding them. Use replace to create modified copies.

Embedding storage: When exporting to CSV or JSON, the embedding array is serialised as a flat list of floats. When exporting to Parquet or HuggingFace format, the array is stored natively.

NLP list fields (tokens, lemmas, stems, keywords) are excluded from __repr__ and equality comparisons because they are large derived views of text.

Examples

Creating from factory with auto-generated id:

>>> doc = CorpusDocument.create(
...     input_path="corpus.xml",
...     chunk_index=3,
...     text="Das Kapital ist ein Werk von Marx.",
...     source_type=SourceType.BOOK,
...     source_author="Marx, Karl",
...     source_title="Das Kapital",
...     language="de",
...     page_number=42,
... )
>>> len(doc.doc_id)
16

Round-tripping to dict and back:

>>> d = doc.to_dict()
>>> restored = CorpusDocument.from_dict(d)
>>> restored.doc_id == doc.doc_id
True

REQUIRED_FIELDS: ClassVar[tuple[str, ...]] = ('doc_id', 'input_path')#

Fields that must be non-empty strings for validate to pass.

Notes

text is intentionally excluded from this tuple. For TEXT-modality documents, validate() enforces non-empty text directly. For raw-media documents (modality is IMAGE, AUDIO, or VIDEO), text may legitimately be None — the document carries its content in raw_tensor or raw_bytes instead.

act: int | None = None#: Act number (one-based) in a dramatic source.

bbox: tuple[float, ...] | None = None#: Bounding box (x0, y0, x1, y1) of the text region.

property char_count: int#

Length of text in characters.

Returns:

int: Character count.

char_end: int | None = None#: Character offset of chunk end (exclusive) in source, or None.

char_start: int | None = None#: Character offset of chunk start in source, or None.

chunk_index: int[source]#: Zero-based position of this chunk within the source document.

chunking_strategy: ChunkingStrategy = 'sentence'[source]#: Strategy used to produce this chunk.

collection_id: str | None = None#: Identifier grouping related sources into one corpus.

confidence: float | None = None#: OCR or ASR confidence score in [0.0, 1.0].

content_hash: str | None = None#: SHA-256 hex digest (32 chars) of canonical content. Dedup key.

classmethod create(input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, metadata=None, doc_id=None, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None, modality=None, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None)[source]#

Validate factory constructor for CorpusDocument.

Preferred over direct dataclass instantiation because it auto-generates doc_id when not supplied and calls validate before returning.

Parameters:

input_pathstr: Name of the source file.
chunk_indexint: Zero-based chunk position.
textstr: Text content of the chunk.
section_typeSectionType, optional: Semantic section label. Default: SectionType.TEXT.
chunking_strategyChunkingStrategy, optional: Segmentation strategy used. Default: ChunkingStrategy.SENTENCE.
languagestr or None, optional: ISO 639-1 language code. Default: None.
char_startint or None, optional: Character start offset. Default: None.
char_endint or None, optional: Character end offset (exclusive). Default: None.
embeddingarray-like or None, optional: Pre-computed embedding vector. Default: None.
metadatadict or None, optional: Ad-hoc metadata. None is treated as empty dict. Default: None.
doc_idstr or None, optional: Explicit document id. Auto-generated if None. Default: None.
source_typeSourceType, optional: Kind of source. Default: SourceType.UNKNOWN.
source_titlestr or None, optional: Title of the source work. Default: None.
source_authorstr or None, optional: Primary author. Default: None.
source_datestr or None, optional: Publication date (ISO 8601). Default: None.
collection_idstr or None, optional: Corpus collection identifier. Default: None.
urlstr or None, optional: Source URL. Default: None.
doistr or None, optional: Digital Object Identifier. Default: None.
isbnstr or None, optional: International Standard Book Number. Default: None.
page_numberint or None, optional: Zero-based page index. Default: None.
paragraph_indexint or None, optional: Zero-based paragraph index. Default: None.
line_numberint or None, optional: Zero-based line number. Default: None.
parent_doc_idstr or None, optional: doc_id of parent chunk. Default: None.
actint or None, optional: Act number (one-based). Default: None.
scene_numberint or None, optional: Scene number (one-based). Default: None.
timecode_startfloat or None, optional: Start timecode in seconds (>= 0). Default: None.
timecode_endfloat or None, optional: End timecode in seconds. Default: None.
confidencefloat or None, optional: OCR/ASR confidence in [0.0, 1.0]. Default: None.
ocr_enginestr or None, optional: OCR engine name. Default: None.
bboxtuple of float or None, optional: Bounding box (x0, y0, x1, y1). Default: None.
normalized_textstr or None, optional: Pre-normalised text. Default: None.
tokenslist of str or None, optional: Tokenised words. Default: None.
lemmaslist of str or None, optional: Lemmatised tokens. Default: None.
stemslist of str or None, optional: Stemmed tokens. Default: None.
keywordslist of str or None, optional: Extracted keyphrases. Default: None.

Returns:

CorpusDocument: Validated document instance.

Raises:

ValueError: If any invariant from validate is violated.

Parameters:

input_path (str)
chunk_index (int)
text (str | None)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
metadata (dict[str, Any] | None)
doc_id (str | None)
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
bbox (tuple[float, ...] | None)
normalized_text (str | None)
tokens (list[str] | None)
lemmas (list[str] | None)
stems (list[str] | None)
keywords (list[str] | None)
modality (Modality | None)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_shape (tuple[int, ...] | None)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)

Return type:

CorpusDocument

Examples

>>> doc = CorpusDocument.create(
...     input_path="corpus.txt",
...     chunk_index=0,
...     text="Hello world.",
...     source_type=SourceType.BOOK,
...     language="en",
... )
>>> doc.validate()
>>> doc.has_embedding
False

doc_id: str[source]#: Stable 16-character hex identifier for this chunk.

doi: str | None = None#: Digital Object Identifier of the source.

embedding: Any | None = None#: Dense vector embedding, or None if not yet computed.

frame_index: int | None = None#

None.

Type:: Zero-based frame index in a video or multi-frame image. Default

classmethod from_dict(data)[source]#

Reconstruct a CorpusDocument from a plain dictionary.

Parameters:

datadict: Dictionary as returned by to_dict. Enum fields are coerced from string values. bbox is restored from list to tuple. metadata defaults to empty dict if absent.

Returns:

CorpusDocument: Validated reconstructed document.

Raises:

ValueError: If required fields are missing or values are invalid.

Parameters:

data (dict[str, Any])

Return type:

Self

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> d = doc.to_dict()
>>> restored = CorpusDocument.from_dict(d)
>>> restored.doc_id == doc.doc_id
True

property has_embedding: bool#

Return True if an embedding has been attached to this document.

Returns:

bool: True when embedding is not None.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> doc.has_embedding
False

input_path: str[source]#: Name of the original source file (not an absolute path).

isbn: str | None = None#: International Standard Book Number of the source.

keywords: list[str] | None = None#: Extracted keyphrases for topic-level matching.

language: str | None = None#: ISO 639-1 language code, or None if unknown.

lemmas: list[str] | None = None#: Lemmatised token list.

line_number: int | None = None#: Zero-based line number within the document.

static make_content_hash(text=None, raw_bytes=None)[source]#

Compute a 32-char SHA-256 hex digest for deduplication.

Parameters:

textstr or None: Text content. Used when raw_bytes is None.
raw_bytesbytes or None: Raw media bytes. Preferred over text when set.

Returns:

str: 32-character hex SHA-256 prefix.

Parameters:

text (str | None)
raw_bytes (bytes | None)

Return type:

str

Notes

Empty / None inputs return a fixed sentinel value "0" * 32 (32 zeros) to ensure content_hash is always populated and the dedup logic is deterministic.

classmethod make_doc_id(input_path, chunk_index, text, source_type=SourceType.UNKNOWN)[source]#

Compute a deterministic 16-character hex document identifier.

The id is a SHA-1 prefix of "{source_type}:{input_path}:{chunk_index}:{text[:64]}". Identical inputs always produce the same id.

Parameters:

input_pathstr: Name of the source file (not a full path).
chunk_indexint: Zero-based chunk position within the document.
textstr: Raw text content of the chunk (only the first 64 characters are used to keep hashing fast).
source_typeSourceType, optional: Source kind. Including this in the hash preimage prevents collisions when a BOOK chapter and a MOVIE subtitle share the same filename, chunk index, and opening text (Issue S-7). Default: SourceType.UNKNOWN.

Returns:

str: 16-character lowercase hexadecimal string.

Parameters:

input_path (str)
chunk_index (int)
text (str)
source_type (SourceType)

Return type:

str

Notes

Adding source_type to the hash preimage is a one-time breaking change for corpora built before this version. Existing corpora must be re-indexed when upgrading.

Examples

>>> CorpusDocument.make_doc_id("file.txt", 0, "Hello world.")
'...'  # deterministic 16-char hex
>>> (
...     CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.BOOK)
...     != CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.MOVIE)
... )
True

metadata: dict[str, Any][source]#: Truly ad-hoc format-specific metadata.

modality: Modality[source]#

Modality.TEXT.

Type:: Primary content modality. Default

normalized_text: str | None = None#: Normalised text used by the embedding engine.

ocr_engine: str | None = None#: Name of the OCR engine used.

page_number: int | None = None#: Zero-based page index within the source document.

paragraph_index: int | None = None#: Zero-based paragraph index within the page or document.

parent_doc_id: str | None = None#: doc_id of the parent chunk when this is a sub-division.

raw_bytes: bytes | None = None#: Raw encoded media bytes (e.g. JPEG bytes). None for text-only.

raw_dtype: str | None = None#

None.

Type:: String dtype of raw_tensor (e.g. "uint8"). Default

raw_shape: tuple[int, ...] | None = None#

None.

Type:: Shape of raw_tensor as a plain Python tuple. Default

raw_tensor: Any = None#: Decoded media array ready for model input. Shape conventions: image (H,W,C) uint8; audio (samples,) float32; video (T,H,W,C) uint8. None for text-only.

replace(**changes)[source]#

Return a new CorpusDocument with the specified fields replaced.

Parameters:

**changesAny: Field names and new values. Only fields defined on CorpusDocument are accepted.

Returns:

CorpusDocument: New instance with changed fields; original is unchanged.

Raises:

ValueError: If an unknown field name is given.

Parameters:

changes (Any)

Return type:

Self

Examples

>>> import numpy as np
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> enriched = doc.replace(embedding=np.zeros(768, dtype=np.float32))
>>> enriched.has_embedding
True
>>> doc.has_embedding  # original unchanged
False

scene_number: int | None = None#: Scene number (one-based) within an act.

section_type: SectionType = 'text'[source]#: Semantic role of this chunk.

source_author: str | None = None#: Primary author of the source.

source_date: str | None = None#: Publication or creation date in ISO 8601 format.

source_title: str | None = None#: Title of the source work.

source_type: SourceType = 'unknown'[source]#: Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …).

stems: list[str] | None = None#: Stemmed token list.

text: str[source]#: Cleaned, segmented text content.

timecode_end: float | None = None#: End timecode in seconds.

timecode_start: float | None = None#: Start timecode in seconds for subtitle / video / audio sources.

to_dict(*, include_embedding=False)[source]#

Serialise to a plain Python dictionary.

Parameters:

include_embeddingbool, optional: When True, include the embedding field serialised as a flat list of floats (if present). Default: False — embeddings are excluded to keep dicts JSON-safe by default.

Returns:

dict: Shallow copy of all fields. Enum fields serialised as string values. bbox serialised as a list (JSON-compatible). metadata is a shallow copy.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Notes

This method does not call validate — it is designed to be fast and usable even on partially-constructed documents during debugging.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> d = doc.to_dict()
>>> isinstance(d["section_type"], str)
True
>>> d["source_type"]
'unknown'

to_flat_dict(*, include_embedding=False)[source]#

Serialise to a flat dictionary with metadata fields promoted to the top level.

Unlike to_dict, the metadata sub-dict is merged into the top level. Core fields take precedence over metadata fields with the same key name.

Parameters:

include_embeddingbool, optional: When True, include embedding as a list of floats. Default: False.

Returns:

dict: Flat dict suitable for a single row in a tabular export.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Notes

Metadata key collisions with core fields are logged as warnings.

Examples

>>> doc = CorpusDocument.create(
...     "f.txt", 0, "Hello.", metadata={"custom_key": "v"}
... )
>>> flat = doc.to_flat_dict()
>>> flat["custom_key"]
'v'

to_pandas_row(*, include_embedding=False)[source]#

Return a dict formatted for a single row in a pandas.DataFrame.

Parameters:

include_embeddingbool, optional: When True, include the embedding as a numpy array (not a list), allowing pandas to store it as an object column. Default: False.

Returns:

dict: Row dict with enums as strings. Embedding kept as-is when present and include_embedding=True.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Examples

>>> import pandas as pd
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> row = doc.to_pandas_row()
>>> pd.DataFrame([row])["text"][0]
'Hello.'

to_polars_row(*, include_embedding=False)[source]#

Return a dict formatted for a single row in a polars.DataFrame.

Parameters:

include_embeddingbool, optional: When True, include the embedding as a list of floats (polars does not accept numpy arrays directly in dict-based construction). Default: False.

Returns:

dict: Row dict. Embedding serialised as list[float] when present.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Examples

>>> import polars as pl
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> pl.DataFrame([doc.to_polars_row()])["text"][0]
'Hello.'

tokens: list[str] | None = None#: Whitespace-tokenised word list for STRICT / KEYWORD matching.

url: str | None = None#: Source URL for web-fetched documents.

validate()[source]#

Assert that all invariants hold. Raises on the first violation.

Raises:

ValueError: With an actionable message identifying the violated invariant and the offending value.

Warns:

UserWarning: When doi does not match the 10.XXXX/ prefix pattern. A warning (not a raise) is used because real-world DOIs are not always well-formed, and hard rejection would discard valid papers.

Return type:

None

Notes

Call validate() explicitly after constructing a document via the dataclass constructor. The create factory calls it automatically.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello world.")
>>> doc.validate()  # no exception

>>> bad = CorpusDocument(
...     doc_id="", input_path="f.txt", chunk_index=0, text="Hello."
... )
>>> bad.validate()
Traceback (most recent call last):
    ...
ValueError: CorpusDocument.doc_id must be a non-empty string; got ''

property word_count: int#

Number of whitespace-delimited tokens in text.

Returns:

int: Token count; 0 for empty text.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "One two three.")
>>> doc.word_count
3

Gallery examples#

corpus WHO European Region local or url per file with examples