CorpusDocument#
- class scikitplot.corpus.CorpusDocument(doc_id, input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, modality=<factory>, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None, metadata=<factory>, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None)[source]#
Canonical representation of a single text chunk in a processed corpus.
A
CorpusDocumentis the unit of data that flows between every stage of the pipeline: readers produce them, chunkers subdivide them, filters accept or reject them, embedders enrich them, and exporters serialise them.- Parameters:
- doc_idstr
Stable 16-character hex identifier. Generated deterministically from
(source_type, input_path, chunk_index, text[:64])viamake_doc_idif not supplied. Must be non-empty.- input_pathstr
Name or relative path of the original source file. Must be non-empty. Set from
input_path.nameby readers; do not include absolute paths to keep corpora portable across machines.- chunk_indexint
Zero-based ordinal of this chunk within the source document. Must be >= 0. Unique per
(input_path, chunking_strategy)pair.- textstr
Cleaned, segmented text content of this chunk. Must be non-empty after stripping whitespace.
- section_typeSectionType, optional
Semantic role of this chunk within its source document. Default:
SectionType.TEXT.- chunking_strategyChunkingStrategy, optional
Strategy used to produce this chunk. Default:
ChunkingStrategy.SENTENCE.- languagestr or None, optional
ISO 639-1 language code. Default:
None.- char_startint or None, optional
Character offset of chunk start within the original document. Default:
None.- char_endint or None, optional
Character offset of chunk end (exclusive). Default:
None.- embeddingarray-like or None, optional
Dense vector representation of
text. Stored asAnyat runtime; the.pyistub providesNDArray[float32]for type checkers. Default:None.- metadatadict, optional
Open-ended key-value store for truly ad-hoc or format-specific fields (ISBN edition, translator, speaker, etc.). All keys must be strings. Default: empty dict.
- source_typeSourceType, optional
Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …). Used as a typed pre-filter column. Default:
SourceType.UNKNOWN.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date in ISO 8601 format. Default:
None.- collection_idstr or None, optional
Identifier grouping related sources into one corpus. Default:
None.- urlstr or None, optional
Source URL for web-fetched documents. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- page_numberint or None, optional
Zero-based page index. Default:
None.- paragraph_indexint or None, optional
Zero-based paragraph index within the page or document. Default:
None.- line_numberint or None, optional
Zero-based line number. Default:
None.- parent_doc_idstr or None, optional
doc_id of the parent chunk when this is a sub-division. Default:
None.- actint or None, optional
Act number (one-based) in a dramatic source. Default:
None.- scene_numberint or None, optional
Scene number (one-based) within an act. Default:
None.- timecode_startfloat or None, optional
Start timecode in seconds (>= 0). Default:
None.- timecode_endfloat or None, optional
End timecode in seconds (>= timecode_start). Default:
None.- confidencefloat or None, optional
OCR or ASR confidence in [0.0, 1.0]. Default:
None.- ocr_enginestr or None, optional
Name of the OCR engine used. Default:
None.- bboxtuple of float or None, optional
Bounding box (x0, y0, x1, y1). Must be a 4-tuple of floats. Default:
None.- normalized_textstr or None, optional
Normalised text used by the embedding engine. Default:
None.- tokenslist of str or None, optional
Tokenised word list (not included in repr or equality). Default:
None.- lemmaslist of str or None, optional
Lemmatised tokens (not included in repr or equality). Default:
None.- stemslist of str or None, optional
Stemmed tokens (not included in repr or equality). Default:
None.- keywordslist of str or None, optional
Extracted keyphrases (not included in repr or equality). Default:
None.
- Attributes:
- REQUIRED_FIELDStuple of str
Class-level tuple of field names that must be non-empty/non-negative for
validateto pass.
- Raises:
- ValueError
If
validateis called and any invariant is violated.
- Parameters:
doc_id (str)
input_path (str)
chunk_index (int)
text (str)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
modality (Modality)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
normalized_text (str | None)
See also
scikitplot.corpus._base.DocumentReaderProduces CorpusDocuments.
scikitplot.corpus._pipeline.CorpusPipelineOrchestrates the full flow.
Notes
Immutability convention:
CorpusDocumentis a mutable dataclass for performance, but pipeline stages must not mutate documents in-place after yielding them. Usereplaceto create modified copies.Embedding storage: When exporting to CSV or JSON, the embedding array is serialised as a flat list of floats. When exporting to Parquet or HuggingFace format, the array is stored natively.
NLP list fields (
tokens,lemmas,stems,keywords) are excluded from__repr__and equality comparisons because they are large derived views oftext.Examples
Creating from factory with auto-generated id:
>>> doc = CorpusDocument.create( ... input_path="corpus.xml", ... chunk_index=3, ... text="Das Kapital ist ein Werk von Marx.", ... source_type=SourceType.BOOK, ... source_author="Marx, Karl", ... source_title="Das Kapital", ... language="de", ... page_number=42, ... ) >>> len(doc.doc_id) 16
Round-tripping to dict and back:
>>> d = doc.to_dict() >>> restored = CorpusDocument.from_dict(d) >>> restored.doc_id == doc.doc_id True
- REQUIRED_FIELDS: ClassVar[tuple[str, ...]] = ('doc_id', 'input_path')#
Fields that must be non-empty strings for
validateto pass.Notes
textis intentionally excluded from this tuple. For TEXT-modality documents,validate()enforces non-empty text directly. For raw-media documents (modalityis IMAGE, AUDIO, or VIDEO),textmay legitimately beNone— the document carries its content inraw_tensororraw_bytesinstead.
- chunking_strategy: ChunkingStrategy = 'sentence'[source]#
Strategy used to produce this chunk.
- classmethod create(input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.SENTENCE, language=None, char_start=None, char_end=None, embedding=None, metadata=None, doc_id=None, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None, modality=None, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None)[source]#
Validate factory constructor for
CorpusDocument.Preferred over direct dataclass instantiation because it auto-generates
doc_idwhen not supplied and callsvalidatebefore returning.- Parameters:
- input_pathstr
Name of the source file.
- chunk_indexint
Zero-based chunk position.
- textstr
Text content of the chunk.
- section_typeSectionType, optional
Semantic section label. Default:
SectionType.TEXT.- chunking_strategyChunkingStrategy, optional
Segmentation strategy used. Default:
ChunkingStrategy.SENTENCE.- languagestr or None, optional
ISO 639-1 language code. Default:
None.- char_startint or None, optional
Character start offset. Default:
None.- char_endint or None, optional
Character end offset (exclusive). Default:
None.- embeddingarray-like or None, optional
Pre-computed embedding vector. Default:
None.- metadatadict or None, optional
Ad-hoc metadata.
Noneis treated as empty dict. Default:None.- doc_idstr or None, optional
Explicit document id. Auto-generated if
None. Default:None.- source_typeSourceType, optional
Kind of source. Default:
SourceType.UNKNOWN.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date (ISO 8601). Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- urlstr or None, optional
Source URL. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- page_numberint or None, optional
Zero-based page index. Default:
None.- paragraph_indexint or None, optional
Zero-based paragraph index. Default:
None.- line_numberint or None, optional
Zero-based line number. Default:
None.- parent_doc_idstr or None, optional
doc_id of parent chunk. Default:
None.- actint or None, optional
Act number (one-based). Default:
None.- scene_numberint or None, optional
Scene number (one-based). Default:
None.- timecode_startfloat or None, optional
Start timecode in seconds (>= 0). Default:
None.- timecode_endfloat or None, optional
End timecode in seconds. Default:
None.- confidencefloat or None, optional
OCR/ASR confidence in [0.0, 1.0]. Default:
None.- ocr_enginestr or None, optional
OCR engine name. Default:
None.- bboxtuple of float or None, optional
Bounding box (x0, y0, x1, y1). Default:
None.- normalized_textstr or None, optional
Pre-normalised text. Default:
None.- tokenslist of str or None, optional
Tokenised words. Default:
None.- lemmaslist of str or None, optional
Lemmatised tokens. Default:
None.- stemslist of str or None, optional
Stemmed tokens. Default:
None.- keywordslist of str or None, optional
Extracted keyphrases. Default:
None.
- Returns:
- CorpusDocument
Validated document instance.
- Raises:
- ValueError
If any invariant from
validateis violated.
- Parameters:
input_path (str)
chunk_index (int)
text (str | None)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
doc_id (str | None)
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
normalized_text (str | None)
modality (Modality | None)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)
- Return type:
Examples
>>> doc = CorpusDocument.create( ... input_path="corpus.txt", ... chunk_index=0, ... text="Hello world.", ... source_type=SourceType.BOOK, ... language="en", ... ) >>> doc.validate() >>> doc.has_embedding False
- frame_index: int | None = None#
None.- Type:
Zero-based frame index in a video or multi-frame image. Default
- classmethod from_dict(data)[source]#
Reconstruct a
CorpusDocumentfrom a plain dictionary.- Parameters:
- datadict
Dictionary as returned by
to_dict. Enum fields are coerced from string values.bboxis restored from list to tuple.metadatadefaults to empty dict if absent.
- Returns:
- CorpusDocument
Validated reconstructed document.
- Raises:
- ValueError
If required fields are missing or values are invalid.
- Parameters:
- Return type:
Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> d = doc.to_dict() >>> restored = CorpusDocument.from_dict(d) >>> restored.doc_id == doc.doc_id True
- property has_embedding: bool#
Return
Trueif an embedding has been attached to this document.- Returns:
- bool
Truewhenembeddingis notNone.
Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> doc.has_embedding False
- static make_content_hash(text=None, raw_bytes=None)[source]#
Compute a 32-char SHA-256 hex digest for deduplication.
- Parameters:
- textstr or None
Text content. Used when
raw_bytesisNone.- raw_bytesbytes or None
Raw media bytes. Preferred over
textwhen set.
- Returns:
- str
32-character hex SHA-256 prefix.
- Parameters:
- Return type:
Notes
Empty /
Noneinputs return a fixed sentinel value"0" * 32(32 zeros) to ensurecontent_hashis always populated and the dedup logic is deterministic.
- classmethod make_doc_id(input_path, chunk_index, text, source_type=SourceType.UNKNOWN)[source]#
Compute a deterministic 16-character hex document identifier.
The id is a SHA-1 prefix of
"{source_type}:{input_path}:{chunk_index}:{text[:64]}". Identical inputs always produce the same id.- Parameters:
- input_pathstr
Name of the source file (not a full path).
- chunk_indexint
Zero-based chunk position within the document.
- textstr
Raw text content of the chunk (only the first 64 characters are used to keep hashing fast).
- source_typeSourceType, optional
Source kind. Including this in the hash preimage prevents collisions when a BOOK chapter and a MOVIE subtitle share the same filename, chunk index, and opening text (Issue S-7). Default:
SourceType.UNKNOWN.
- Returns:
- str
16-character lowercase hexadecimal string.
- Parameters:
input_path (str)
chunk_index (int)
text (str)
source_type (SourceType)
- Return type:
Notes
Adding
source_typeto the hash preimage is a one-time breaking change for corpora built before this version. Existing corpora must be re-indexed when upgrading.Examples
>>> CorpusDocument.make_doc_id("file.txt", 0, "Hello world.") '...' # deterministic 16-char hex >>> ( ... CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.BOOK) ... != CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.MOVIE) ... ) True
- raw_shape: tuple[int, ...] | None = None#
None.- Type:
Shape of
raw_tensoras a plain Python tuple. Default
- raw_tensor: Any = None#
Decoded media array ready for model input. Shape conventions: image
(H,W,C)uint8; audio(samples,)float32; video(T,H,W,C)uint8.Nonefor text-only.
- replace(**changes)[source]#
Return a new
CorpusDocumentwith the specified fields replaced.- Parameters:
- **changesAny
Field names and new values. Only fields defined on
CorpusDocumentare accepted.
- Returns:
- CorpusDocument
New instance with changed fields; original is unchanged.
- Raises:
- ValueError
If an unknown field name is given.
- Parameters:
changes (Any)
- Return type:
Examples
>>> import numpy as np >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> enriched = doc.replace(embedding=np.zeros(768, dtype=np.float32)) >>> enriched.has_embedding True >>> doc.has_embedding # original unchanged False
- section_type: SectionType = 'text'[source]#
Semantic role of this chunk.
- source_type: SourceType = 'unknown'[source]#
Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …).
- timecode_start: float | None = None#
Start timecode in seconds for subtitle / video / audio sources.
- to_dict(*, include_embedding=False)[source]#
Serialise to a plain Python dictionary.
- Parameters:
- include_embeddingbool, optional
When
True, include theembeddingfield serialised as a flat list of floats (if present). Default:False— embeddings are excluded to keep dicts JSON-safe by default.
- Returns:
- dict
Shallow copy of all fields. Enum fields serialised as string values.
bboxserialised as a list (JSON-compatible).metadatais a shallow copy.
- Parameters:
include_embedding (bool)
- Return type:
Notes
This method does not call
validate— it is designed to be fast and usable even on partially-constructed documents during debugging.Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> d = doc.to_dict() >>> isinstance(d["section_type"], str) True >>> d["source_type"] 'unknown'
- to_flat_dict(*, include_embedding=False)[source]#
Serialise to a flat dictionary with metadata fields promoted to the top level.
Unlike
to_dict, themetadatasub-dict is merged into the top level. Core fields take precedence over metadata fields with the same key name.- Parameters:
- include_embeddingbool, optional
When
True, includeembeddingas a list of floats. Default:False.
- Returns:
- dict
Flat dict suitable for a single row in a tabular export.
- Parameters:
include_embedding (bool)
- Return type:
Notes
Metadata key collisions with core fields are logged as warnings.
Examples
>>> doc = CorpusDocument.create( ... "f.txt", 0, "Hello.", metadata={"custom_key": "v"} ... ) >>> flat = doc.to_flat_dict() >>> flat["custom_key"] 'v'
- to_pandas_row(*, include_embedding=False)[source]#
Return a dict formatted for a single row in a
pandas.DataFrame.- Parameters:
- include_embeddingbool, optional
When
True, include the embedding as a numpy array (not a list), allowingpandasto store it as an object column. Default:False.
- Returns:
- dict
Row dict with enums as strings. Embedding kept as-is when present and
include_embedding=True.
- Parameters:
include_embedding (bool)
- Return type:
Examples
>>> import pandas as pd >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> row = doc.to_pandas_row() >>> pd.DataFrame([row])["text"][0] 'Hello.'
- to_polars_row(*, include_embedding=False)[source]#
Return a dict formatted for a single row in a
polars.DataFrame.- Parameters:
- include_embeddingbool, optional
When
True, include the embedding as a list of floats (polars does not accept numpy arrays directly in dict-based construction). Default:False.
- Returns:
- dict
Row dict. Embedding serialised as
list[float]when present.
- Parameters:
include_embedding (bool)
- Return type:
Examples
>>> import polars as pl >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> pl.DataFrame([doc.to_polars_row()])["text"][0] 'Hello.'
- validate()[source]#
Assert that all invariants hold. Raises on the first violation.
- Raises:
- ValueError
With an actionable message identifying the violated invariant and the offending value.
- Warns:
- UserWarning
When
doidoes not match the10.XXXX/prefix pattern. A warning (not a raise) is used because real-world DOIs are not always well-formed, and hard rejection would discard valid papers.
- Return type:
None
Notes
Call
validate()explicitly after constructing a document via the dataclass constructor. Thecreatefactory calls it automatically.Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello world.") >>> doc.validate() # no exception
>>> bad = CorpusDocument( ... doc_id="", input_path="f.txt", chunk_index=0, text="Hello." ... ) >>> bad.validate() Traceback (most recent call last): ... ValueError: CorpusDocument.doc_id must be a non-empty string; got ''
Gallery examples#
corpus WHO European Region local or url per file with examples