CorpusDocument#

class scikitplot.corpus.CorpusDocument(doc_id, input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, modality=<factory>, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None, metadata=<factory>, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None)[source]#

Canonical representation of a single text chunk in a processed corpus.

A CorpusDocument is the unit of data that flows between every stage of the pipeline: readers produce them, chunkers subdivide them, filters accept or reject them, embedders enrich them, and exporters serialise them.

Parameters:
doc_idstr

Stable 16-character hex identifier. Generated deterministically from (source_type, input_path, chunk_index, text[:64]) via make_doc_id if not supplied. Must be non-empty.

input_pathstr

Name or relative path of the original source file. Must be non-empty. Set from input_path.name by readers; do not include absolute paths to keep corpora portable across machines.

chunk_indexint

Zero-based ordinal of this chunk within the source document. Must be >= 0. Unique per (input_path, chunking_strategy) pair.

textstr

Cleaned, segmented text content of this chunk. Must be non-empty after stripping whitespace.

section_typeSectionType, optional

Semantic role of this chunk within its source document. Default: SectionType.TEXT.

chunking_strategyChunkingStrategy, optional

Strategy used to produce this chunk. Default: ChunkingStrategy.SENTENCE.

languagestr or None, optional

ISO 639-1 language code. Default: None.

char_startint or None, optional

Character offset of chunk start within the original document. Default: None.

char_endint or None, optional

Character offset of chunk end (exclusive). Default: None.

embeddingarray-like or None, optional

Dense vector representation of text. Stored as Any at runtime; the .pyi stub provides NDArray[float32] for type checkers. Default: None.

metadatadict, optional

Open-ended key-value store for truly ad-hoc or format-specific fields (ISBN edition, translator, speaker, etc.). All keys must be strings. Default: empty dict.

source_typeSourceType, optional

Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …). Used as a typed pre-filter column. Default: SourceType.UNKNOWN.

source_titlestr or None, optional

Title of the source work. Default: None.

source_authorstr or None, optional

Primary author. Default: None.

source_datestr or None, optional

Publication date in ISO 8601 format. Default: None.

collection_idstr or None, optional

Identifier grouping related sources into one corpus. Default: None.

urlstr or None, optional

Source URL for web-fetched documents. Default: None.

doistr or None, optional

Digital Object Identifier. Default: None.

isbnstr or None, optional

International Standard Book Number. Default: None.

page_numberint or None, optional

Zero-based page index. Default: None.

paragraph_indexint or None, optional

Zero-based paragraph index within the page or document. Default: None.

line_numberint or None, optional

Zero-based line number. Default: None.

parent_doc_idstr or None, optional

doc_id of the parent chunk when this is a sub-division. Default: None.

actint or None, optional

Act number (one-based) in a dramatic source. Default: None.

scene_numberint or None, optional

Scene number (one-based) within an act. Default: None.

timecode_startfloat or None, optional

Start timecode in seconds (>= 0). Default: None.

timecode_endfloat or None, optional

End timecode in seconds (>= timecode_start). Default: None.

confidencefloat or None, optional

OCR or ASR confidence in [0.0, 1.0]. Default: None.

ocr_enginestr or None, optional

Name of the OCR engine used. Default: None.

bboxtuple of float or None, optional

Bounding box (x0, y0, x1, y1). Must be a 4-tuple of floats. Default: None.

normalized_textstr or None, optional

Normalised text used by the embedding engine. Default: None.

tokenslist of str or None, optional

Tokenised word list (not included in repr or equality). Default: None.

lemmaslist of str or None, optional

Lemmatised tokens (not included in repr or equality). Default: None.

stemslist of str or None, optional

Stemmed tokens (not included in repr or equality). Default: None.

keywordslist of str or None, optional

Extracted keyphrases (not included in repr or equality). Default: None.

Attributes:
REQUIRED_FIELDStuple of str

Class-level tuple of field names that must be non-empty/non-negative for validate to pass.

Raises:
ValueError

If validate is called and any invariant is violated.

Parameters:
  • doc_id (str)

  • input_path (str)

  • chunk_index (int)

  • text (str)

  • section_type (SectionType)

  • chunking_strategy (ChunkingStrategy)

  • language (str | None)

  • char_start (int | None)

  • char_end (int | None)

  • embedding (Any | None)

  • modality (Modality)

  • raw_bytes (bytes | None)

  • raw_tensor (Any)

  • raw_shape (tuple[int, ...] | None)

  • raw_dtype (str | None)

  • frame_index (int | None)

  • content_hash (str | None)

  • metadata (dict[str, Any])

  • source_type (SourceType)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • url (str | None)

  • doi (str | None)

  • isbn (str | None)

  • page_number (int | None)

  • paragraph_index (int | None)

  • line_number (int | None)

  • parent_doc_id (str | None)

  • act (int | None)

  • scene_number (int | None)

  • timecode_start (float | None)

  • timecode_end (float | None)

  • confidence (float | None)

  • ocr_engine (str | None)

  • bbox (tuple[float, ...] | None)

  • normalized_text (str | None)

  • tokens (list[str] | None)

  • lemmas (list[str] | None)

  • stems (list[str] | None)

  • keywords (list[str] | None)

See also

scikitplot.corpus._base.DocumentReader

Produces CorpusDocuments.

scikitplot.corpus._pipeline.CorpusPipeline

Orchestrates the full flow.

Notes

Immutability convention: CorpusDocument is a mutable dataclass for performance, but pipeline stages must not mutate documents in-place after yielding them. Use replace to create modified copies.

Embedding storage: When exporting to CSV or JSON, the embedding array is serialised as a flat list of floats. When exporting to Parquet or HuggingFace format, the array is stored natively.

NLP list fields (tokens, lemmas, stems, keywords) are excluded from __repr__ and equality comparisons because they are large derived views of text.

Examples

Creating from factory with auto-generated id:

>>> doc = CorpusDocument.create(
...     input_path="corpus.xml",
...     chunk_index=3,
...     text="Das Kapital ist ein Werk von Marx.",
...     source_type=SourceType.BOOK,
...     source_author="Marx, Karl",
...     source_title="Das Kapital",
...     language="de",
...     page_number=42,
... )
>>> len(doc.doc_id)
16

Round-tripping to dict and back:

>>> d = doc.to_dict()
>>> restored = CorpusDocument.from_dict(d)
>>> restored.doc_id == doc.doc_id
True
REQUIRED_FIELDS: ClassVar[tuple[str, ...]] = ('doc_id', 'input_path')#

Fields that must be non-empty strings for validate to pass.

Notes

text is intentionally excluded from this tuple. For TEXT-modality documents, validate() enforces non-empty text directly. For raw-media documents (modality is IMAGE, AUDIO, or VIDEO), text may legitimately be None — the document carries its content in raw_tensor or raw_bytes instead.

act: int | None = None#

Act number (one-based) in a dramatic source.

bbox: tuple[float, ...] | None = None#

Bounding box (x0, y0, x1, y1) of the text region.

property char_count: int#

Length of text in characters.

Returns:
int

Character count.

char_end: int | None = None#

Character offset of chunk end (exclusive) in source, or None.

char_start: int | None = None#

Character offset of chunk start in source, or None.

chunk_index: int[source]#

Zero-based position of this chunk within the source document.

chunking_strategy: ChunkingStrategy = 'sentence'[source]#

Strategy used to produce this chunk.

collection_id: str | None = None#

Identifier grouping related sources into one corpus.

confidence: float | None = None#

OCR or ASR confidence score in [0.0, 1.0].

content_hash: str | None = None#

SHA-256 hex digest (32 chars) of canonical content. Dedup key.

classmethod create(input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, metadata=None, doc_id=None, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None, modality=None, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None)[source]#

Validate factory constructor for CorpusDocument.

Preferred over direct dataclass instantiation because it auto-generates doc_id when not supplied and calls validate before returning.

Parameters:
input_pathstr

Name of the source file.

chunk_indexint

Zero-based chunk position.

textstr

Text content of the chunk.

section_typeSectionType, optional

Semantic section label. Default: SectionType.TEXT.

chunking_strategyChunkingStrategy, optional

Segmentation strategy used. Default: ChunkingStrategy.SENTENCE.

languagestr or None, optional

ISO 639-1 language code. Default: None.

char_startint or None, optional

Character start offset. Default: None.

char_endint or None, optional

Character end offset (exclusive). Default: None.

embeddingarray-like or None, optional

Pre-computed embedding vector. Default: None.

metadatadict or None, optional

Ad-hoc metadata. None is treated as empty dict. Default: None.

doc_idstr or None, optional

Explicit document id. Auto-generated if None. Default: None.

source_typeSourceType, optional

Kind of source. Default: SourceType.UNKNOWN.

source_titlestr or None, optional

Title of the source work. Default: None.

source_authorstr or None, optional

Primary author. Default: None.

source_datestr or None, optional

Publication date (ISO 8601). Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

urlstr or None, optional

Source URL. Default: None.

doistr or None, optional

Digital Object Identifier. Default: None.

isbnstr or None, optional

International Standard Book Number. Default: None.

page_numberint or None, optional

Zero-based page index. Default: None.

paragraph_indexint or None, optional

Zero-based paragraph index. Default: None.

line_numberint or None, optional

Zero-based line number. Default: None.

parent_doc_idstr or None, optional

doc_id of parent chunk. Default: None.

actint or None, optional

Act number (one-based). Default: None.

scene_numberint or None, optional

Scene number (one-based). Default: None.

timecode_startfloat or None, optional

Start timecode in seconds (>= 0). Default: None.

timecode_endfloat or None, optional

End timecode in seconds. Default: None.

confidencefloat or None, optional

OCR/ASR confidence in [0.0, 1.0]. Default: None.

ocr_enginestr or None, optional

OCR engine name. Default: None.

bboxtuple of float or None, optional

Bounding box (x0, y0, x1, y1). Default: None.

normalized_textstr or None, optional

Pre-normalised text. Default: None.

tokenslist of str or None, optional

Tokenised words. Default: None.

lemmaslist of str or None, optional

Lemmatised tokens. Default: None.

stemslist of str or None, optional

Stemmed tokens. Default: None.

keywordslist of str or None, optional

Extracted keyphrases. Default: None.

Returns:
CorpusDocument

Validated document instance.

Raises:
ValueError

If any invariant from validate is violated.

Parameters:
  • input_path (str)

  • chunk_index (int)

  • text (str | None)

  • section_type (SectionType)

  • chunking_strategy (ChunkingStrategy)

  • language (str | None)

  • char_start (int | None)

  • char_end (int | None)

  • embedding (Any | None)

  • metadata (dict[str, Any] | None)

  • doc_id (str | None)

  • source_type (SourceType)

  • source_title (str | None)

  • source_author (str | None)

  • source_date (str | None)

  • collection_id (str | None)

  • url (str | None)

  • doi (str | None)

  • isbn (str | None)

  • page_number (int | None)

  • paragraph_index (int | None)

  • line_number (int | None)

  • parent_doc_id (str | None)

  • act (int | None)

  • scene_number (int | None)

  • timecode_start (float | None)

  • timecode_end (float | None)

  • confidence (float | None)

  • ocr_engine (str | None)

  • bbox (tuple[float, ...] | None)

  • normalized_text (str | None)

  • tokens (list[str] | None)

  • lemmas (list[str] | None)

  • stems (list[str] | None)

  • keywords (list[str] | None)

  • modality (Modality | None)

  • raw_bytes (bytes | None)

  • raw_tensor (Any)

  • raw_shape (tuple[int, ...] | None)

  • raw_dtype (str | None)

  • frame_index (int | None)

  • content_hash (str | None)

Return type:

CorpusDocument

Examples

>>> doc = CorpusDocument.create(
...     input_path="corpus.txt",
...     chunk_index=0,
...     text="Hello world.",
...     source_type=SourceType.BOOK,
...     language="en",
... )
>>> doc.validate()
>>> doc.has_embedding
False
doc_id: str[source]#

Stable 16-character hex identifier for this chunk.

doi: str | None = None#

Digital Object Identifier of the source.

embedding: Any | None = None#

Dense vector embedding, or None if not yet computed.

frame_index: int | None = None#

None.

Type:

Zero-based frame index in a video or multi-frame image. Default

classmethod from_dict(data)[source]#

Reconstruct a CorpusDocument from a plain dictionary.

Parameters:
datadict

Dictionary as returned by to_dict. Enum fields are coerced from string values. bbox is restored from list to tuple. metadata defaults to empty dict if absent.

Returns:
CorpusDocument

Validated reconstructed document.

Raises:
ValueError

If required fields are missing or values are invalid.

Parameters:

data (dict[str, Any])

Return type:

Self

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> d = doc.to_dict()
>>> restored = CorpusDocument.from_dict(d)
>>> restored.doc_id == doc.doc_id
True
property has_embedding: bool#

Return True if an embedding has been attached to this document.

Returns:
bool

True when embedding is not None.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> doc.has_embedding
False
input_path: str[source]#

Name of the original source file (not an absolute path).

isbn: str | None = None#

International Standard Book Number of the source.

keywords: list[str] | None = None#

Extracted keyphrases for topic-level matching.

language: str | None = None#

ISO 639-1 language code, or None if unknown.

lemmas: list[str] | None = None#

Lemmatised token list.

line_number: int | None = None#

Zero-based line number within the document.

static make_content_hash(text=None, raw_bytes=None)[source]#

Compute a 32-char SHA-256 hex digest for deduplication.

Parameters:
textstr or None

Text content. Used when raw_bytes is None.

raw_bytesbytes or None

Raw media bytes. Preferred over text when set.

Returns:
str

32-character hex SHA-256 prefix.

Parameters:
  • text (str | None)

  • raw_bytes (bytes | None)

Return type:

str

Notes

Empty / None inputs return a fixed sentinel value "0" * 32 (32 zeros) to ensure content_hash is always populated and the dedup logic is deterministic.

classmethod make_doc_id(input_path, chunk_index, text, source_type=SourceType.UNKNOWN)[source]#

Compute a deterministic 16-character hex document identifier.

The id is a SHA-1 prefix of "{source_type}:{input_path}:{chunk_index}:{text[:64]}". Identical inputs always produce the same id.

Parameters:
input_pathstr

Name of the source file (not a full path).

chunk_indexint

Zero-based chunk position within the document.

textstr

Raw text content of the chunk (only the first 64 characters are used to keep hashing fast).

source_typeSourceType, optional

Source kind. Including this in the hash preimage prevents collisions when a BOOK chapter and a MOVIE subtitle share the same filename, chunk index, and opening text (Issue S-7). Default: SourceType.UNKNOWN.

Returns:
str

16-character lowercase hexadecimal string.

Parameters:
Return type:

str

Notes

Adding source_type to the hash preimage is a one-time breaking change for corpora built before this version. Existing corpora must be re-indexed when upgrading.

Examples

>>> CorpusDocument.make_doc_id("file.txt", 0, "Hello world.")
'...'  # deterministic 16-char hex
>>> (
...     CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.BOOK)
...     != CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.MOVIE)
... )
True
metadata: dict[str, Any][source]#

Truly ad-hoc format-specific metadata.

modality: Modality[source]#

Modality.TEXT.

Type:

Primary content modality. Default

normalized_text: str | None = None#

Normalised text used by the embedding engine.

ocr_engine: str | None = None#

Name of the OCR engine used.

page_number: int | None = None#

Zero-based page index within the source document.

paragraph_index: int | None = None#

Zero-based paragraph index within the page or document.

parent_doc_id: str | None = None#

doc_id of the parent chunk when this is a sub-division.

raw_bytes: bytes | None = None#

Raw encoded media bytes (e.g. JPEG bytes). None for text-only.

raw_dtype: str | None = None#

None.

Type:

String dtype of raw_tensor (e.g. "uint8"). Default

raw_shape: tuple[int, ...] | None = None#

None.

Type:

Shape of raw_tensor as a plain Python tuple. Default

raw_tensor: Any = None#

Decoded media array ready for model input. Shape conventions: image (H,W,C) uint8; audio (samples,) float32; video (T,H,W,C) uint8. None for text-only.

replace(**changes)[source]#

Return a new CorpusDocument with the specified fields replaced.

Parameters:
**changesAny

Field names and new values. Only fields defined on CorpusDocument are accepted.

Returns:
CorpusDocument

New instance with changed fields; original is unchanged.

Raises:
ValueError

If an unknown field name is given.

Parameters:

changes (Any)

Return type:

Self

Examples

>>> import numpy as np
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> enriched = doc.replace(embedding=np.zeros(768, dtype=np.float32))
>>> enriched.has_embedding
True
>>> doc.has_embedding  # original unchanged
False
scene_number: int | None = None#

Scene number (one-based) within an act.

section_type: SectionType = 'text'[source]#

Semantic role of this chunk.

source_author: str | None = None#

Primary author of the source.

source_date: str | None = None#

Publication or creation date in ISO 8601 format.

source_title: str | None = None#

Title of the source work.

source_type: SourceType = 'unknown'[source]#

Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …).

stems: list[str] | None = None#

Stemmed token list.

text: str[source]#

Cleaned, segmented text content.

timecode_end: float | None = None#

End timecode in seconds.

timecode_start: float | None = None#

Start timecode in seconds for subtitle / video / audio sources.

to_dict(*, include_embedding=False)[source]#

Serialise to a plain Python dictionary.

Parameters:
include_embeddingbool, optional

When True, include the embedding field serialised as a flat list of floats (if present). Default: False — embeddings are excluded to keep dicts JSON-safe by default.

Returns:
dict

Shallow copy of all fields. Enum fields serialised as string values. bbox serialised as a list (JSON-compatible). metadata is a shallow copy.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Notes

This method does not call validate — it is designed to be fast and usable even on partially-constructed documents during debugging.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> d = doc.to_dict()
>>> isinstance(d["section_type"], str)
True
>>> d["source_type"]
'unknown'
to_flat_dict(*, include_embedding=False)[source]#

Serialise to a flat dictionary with metadata fields promoted to the top level.

Unlike to_dict, the metadata sub-dict is merged into the top level. Core fields take precedence over metadata fields with the same key name.

Parameters:
include_embeddingbool, optional

When True, include embedding as a list of floats. Default: False.

Returns:
dict

Flat dict suitable for a single row in a tabular export.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Notes

Metadata key collisions with core fields are logged as warnings.

Examples

>>> doc = CorpusDocument.create(
...     "f.txt", 0, "Hello.", metadata={"custom_key": "v"}
... )
>>> flat = doc.to_flat_dict()
>>> flat["custom_key"]
'v'
to_pandas_row(*, include_embedding=False)[source]#

Return a dict formatted for a single row in a pandas.DataFrame.

Parameters:
include_embeddingbool, optional

When True, include the embedding as a numpy array (not a list), allowing pandas to store it as an object column. Default: False.

Returns:
dict

Row dict with enums as strings. Embedding kept as-is when present and include_embedding=True.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Examples

>>> import pandas as pd
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> row = doc.to_pandas_row()
>>> pd.DataFrame([row])["text"][0]
'Hello.'
to_polars_row(*, include_embedding=False)[source]#

Return a dict formatted for a single row in a polars.DataFrame.

Parameters:
include_embeddingbool, optional

When True, include the embedding as a list of floats (polars does not accept numpy arrays directly in dict-based construction). Default: False.

Returns:
dict

Row dict. Embedding serialised as list[float] when present.

Parameters:

include_embedding (bool)

Return type:

dict[str, Any]

Examples

>>> import polars as pl
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.")
>>> pl.DataFrame([doc.to_polars_row()])["text"][0]
'Hello.'
tokens: list[str] | None = None#

Whitespace-tokenised word list for STRICT / KEYWORD matching.

url: str | None = None#

Source URL for web-fetched documents.

validate()[source]#

Assert that all invariants hold. Raises on the first violation.

Raises:
ValueError

With an actionable message identifying the violated invariant and the offending value.

Warns:
UserWarning

When doi does not match the 10.XXXX/ prefix pattern. A warning (not a raise) is used because real-world DOIs are not always well-formed, and hard rejection would discard valid papers.

Return type:

None

Notes

Call validate() explicitly after constructing a document via the dataclass constructor. The create factory calls it automatically.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "Hello world.")
>>> doc.validate()  # no exception
>>> bad = CorpusDocument(
...     doc_id="", input_path="f.txt", chunk_index=0, text="Hello."
... )
>>> bad.validate()
Traceback (most recent call last):
    ...
ValueError: CorpusDocument.doc_id must be a non-empty string; got ''
property word_count: int#

Number of whitespace-delimited tokens in text.

Returns:
int

Token count; 0 for empty text.

Examples

>>> doc = CorpusDocument.create("f.txt", 0, "One two three.")
>>> doc.word_count
3