CorpusDocument#
- class scikitplot.corpus.CorpusDocument(doc_id, input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.NONE, language=None, char_start=None, char_end=None, embedding=None, modality=<factory>, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None, metadata=<factory>, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, raw_text=None, normalized_text=None, tokens=None, lemmas=None, stems=None, keywords=None, script=None, script_direction=None, grapheme_count=None, codepoint_count=None, is_mixed_script=None, script_spans=None, chunking_unit=None, semanteme_count=None, morphemes=None, determinative_groups=None, script_model_version=None)[source]#
Canonical representation of a single text chunk in a processed corpus.
A
CorpusDocumentis the unit of data that flows between every stage of the pipeline: readers produce them, chunkers subdivide them, filters accept or reject them, embedders enrich them, and exporters serialise them.- Parameters:
- doc_idstr
Stable 16-character hex identifier. Generated deterministically from
(source_type, input_path, chunk_index, text[:64])viamake_doc_idif not supplied. Must be non-empty.- input_pathstr
Name or relative path of the original source file. Must be non-empty. Set from
input_path.nameby readers; do not include absolute paths to keep corpora portable across machines.- chunk_indexint
Zero-based ordinal of this chunk within the source document. Must be >= 0. Unique per
(input_path, chunking_strategy)pair.- textstr
Cleaned, segmented text content of this chunk. Must be non-empty after stripping whitespace.
- section_typeSectionType, optional
Semantic role of this chunk within its source document. Default:
SectionType.TEXT.- chunking_strategyChunkingStrategy, optional
Strategy used to produce this chunk. Default:
ChunkingStrategy.NONE(whole document, no splitting applied).- languagestr or None, optional
ISO 639-1 language code. Default:
None.- char_startint or None, optional
Character offset of chunk start within the original document. Default:
None.- char_endint or None, optional
Character offset of chunk end (exclusive). Default:
None.- embeddingarray-like or None, optional
Dense vector representation of
text. Stored asAnyat runtime; the.pyistub providesNDArray[float32]for type checkers. Default:None.- metadatadict, optional
Open-ended key-value store for truly ad-hoc or format-specific fields (ISBN edition, translator, speaker, etc.). All keys must be strings. Default: empty dict.
- source_typeSourceType, optional
Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …). Used as a typed pre-filter column. Default:
SourceType.UNKNOWN.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date in ISO 8601 format. Default:
None.- collection_idstr or None, optional
Identifier grouping related sources into one corpus. Default:
None.- urlstr or None, optional
Source URL for web-fetched documents. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- page_numberint or None, optional
Zero-based page index. Default:
None.- paragraph_indexint or None, optional
Zero-based paragraph index within the page or document. Default:
None.- line_numberint or None, optional
Zero-based line number. Default:
None.- parent_doc_idstr or None, optional
doc_id of the parent chunk when this is a sub-division. Default:
None.- actint or None, optional
Act number (one-based) in a dramatic source. Default:
None.- scene_numberint or None, optional
Scene number (one-based) within an act. Default:
None.- timecode_startfloat or None, optional
Start timecode in seconds (>= 0). Default:
None.- timecode_endfloat or None, optional
End timecode in seconds (>= timecode_start). Default:
None.- confidencefloat or None, optional
OCR or ASR confidence in [0.0, 1.0]. Default:
None.- ocr_enginestr or None, optional
Name of the OCR engine used. Default:
None.- bboxtuple of float or None, optional
Bounding box (x0, y0, x1, y1). Must be a 4-tuple of floats. Default:
None.- normalized_textstr or None, optional
Normalised text used by the embedding engine. Default:
None.- tokenslist of str or None, optional
Tokenised word list (not included in repr or equality). Default:
None.- lemmaslist of str or None, optional
Lemmatised tokens (not included in repr or equality). Default:
None.- stemslist of str or None, optional
Stemmed tokens (not included in repr or equality). Default:
None.- keywordslist of str or None, optional
Extracted keyphrases (not included in repr or equality). Default:
None.
- Attributes:
- REQUIRED_FIELDStuple of str
Class-level tuple of field names that must be non-empty/non-negative for
validateto pass.
- Raises:
- ValueError
If
validateis called and any invariant is violated.
- Parameters:
doc_id (str)
input_path (str)
chunk_index (int)
text (str)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
modality (Modality)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
raw_text (str | None)
normalized_text (str | None)
script (str | None)
script_direction (str | None)
grapheme_count (int | None)
codepoint_count (int | None)
is_mixed_script (bool | None)
script_spans (list | None)
chunking_unit (str | None)
semanteme_count (int | None)
determinative_groups (list | None)
script_model_version (str | None)
See also
scikitplot.corpus._base.DocumentReaderProduces CorpusDocuments.
scikitplot.corpus._pipeline.CorpusPipelineOrchestrates the full flow.
Notes
Immutability convention:
CorpusDocumentis a mutable dataclass for performance, but pipeline stages must not mutate documents in-place after yielding them. Usereplaceto create modified copies.Embedding storage: When exporting to CSV or JSON, the embedding array is serialised as a flat list of floats. When exporting to Parquet or HuggingFace format, the array is stored natively.
NLP list fields (
tokens,lemmas,stems,keywords) are excluded from__repr__and equality comparisons because they are large derived views oftext.Examples
Creating from factory with auto-generated id:
>>> doc = CorpusDocument.create( ... input_path="corpus.xml", ... chunk_index=3, ... text="Das Kapital ist ein Werk von Marx.", ... source_type=SourceType.BOOK, ... source_author="Marx, Karl", ... source_title="Das Kapital", ... language="de", ... page_number=42, ... ) >>> len(doc.doc_id) 16
Round-tripping to dict and back:
>>> d = doc.to_dict() >>> restored = CorpusDocument.from_dict(d) >>> restored.doc_id == doc.doc_id True
- REQUIRED_FIELDS: ClassVar[tuple[str, ...]] = ('doc_id', 'input_path')#
Fields that must be non-empty strings for
validateto pass.Notes
textis intentionally excluded from this tuple. For TEXT-modality documents,validate()enforces non-empty text directly. For raw-media documents (modalityis IMAGE, AUDIO, or VIDEO),textmay legitimately beNone— the document carries its content inraw_tensororraw_bytesinstead.
- bbox: tuple[float, float, float, float] | None = None#
Bounding box
(x0, y0, x1, y1)of the text region in page coordinates.All four values are floats. Invariants enforced by
validate:x0 < x1(non-zero width) andy0 < y1(non-zero height).Nonefor documents without a spatial layout (plain text, audio, etc.).
- chunking_strategy: ChunkingStrategy = 'none'[source]#
-
Set explicitly by chunkers when they produce sub-chunks from a raw document.
NONEmeans no segmentation was applied — the whole document is one chunk. Chunkers must always override this to their corresponding strategy value so that downstream consumers can reproduce or verify segmentation.- Type:
Strategy used to produce this chunk. Default
- chunking_unit: str | None = None#
Granularity at which this chunk was produced.
One of
"sentence","paragraph","word","grapheme_cluster","semanteme","morpheme","character","fixed_window".Nonefor legacy chunks produced before this field was introduced.
- codepoint_count: int | None = None#
Number of Unicode codepoints in
text.Equal to
len(text). Stored explicitly so downstream consumers can compare grapheme vs. codepoint lengths without re-reading the text.Noneif not computed.
- classmethod create(input_path, chunk_index, text, section_type=SectionType.TEXT, chunking_strategy=ChunkingStrategy.NONE, language=None, char_start=None, char_end=None, embedding=None, metadata=None, doc_id=None, source_type=SourceType.UNKNOWN, source_title=None, source_author=None, source_date=None, collection_id=None, url=None, doi=None, isbn=None, page_number=None, paragraph_index=None, line_number=None, parent_doc_id=None, act=None, scene_number=None, timecode_start=None, timecode_end=None, confidence=None, ocr_engine=None, bbox=None, normalized_text=None, raw_text=None, tokens=None, lemmas=None, stems=None, keywords=None, modality=None, raw_bytes=None, raw_tensor=None, raw_shape=None, raw_dtype=None, frame_index=None, content_hash=None, script=None, script_direction=None, grapheme_count=None, codepoint_count=None, is_mixed_script=None, script_spans=None, chunking_unit=None, semanteme_count=None, morphemes=None, determinative_groups=None, script_model_version=None)[source]#
Validate factory constructor for
CorpusDocument.Preferred over direct dataclass instantiation because it auto-generates
doc_idwhen not supplied and callsvalidatebefore returning.- Parameters:
- input_pathstr
Name of the source file.
- chunk_indexint
Zero-based chunk position.
- textstr
Text content of the chunk.
- section_typeSectionType, optional
Semantic section label. Default:
SectionType.TEXT.- chunking_strategyChunkingStrategy, optional
Segmentation strategy used. Default:
ChunkingStrategy.NONE.- languagestr or None, optional
ISO 639-1 language code. Default:
None.- char_startint or None, optional
Character start offset. Default:
None.- char_endint or None, optional
Character end offset (exclusive). Default:
None.- embeddingarray-like or None, optional
Pre-computed embedding vector. Default:
None.- metadatadict or None, optional
Ad-hoc metadata.
Noneis treated as empty dict. Default:None.- doc_idstr or None, optional
Explicit document id. Auto-generated if
None. Default:None.- source_typeSourceType, optional
Kind of source. Default:
SourceType.UNKNOWN.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date (ISO 8601). Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- urlstr or None, optional
Source URL. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- page_numberint or None, optional
Zero-based page index. Default:
None.- paragraph_indexint or None, optional
Zero-based paragraph index. Default:
None.- line_numberint or None, optional
Zero-based line number. Default:
None.- parent_doc_idstr or None, optional
doc_id of parent chunk. Default:
None.- actint or None, optional
Act number (one-based). Default:
None.- scene_numberint or None, optional
Scene number (one-based). Default:
None.- timecode_startfloat or None, optional
Start timecode in seconds (>= 0). Default:
None.- timecode_endfloat or None, optional
End timecode in seconds. Default:
None.- confidencefloat or None, optional
OCR/ASR confidence in [0.0, 1.0]. Default:
None.- ocr_enginestr or None, optional
OCR engine name. Default:
None.- bboxtuple of float or None, optional
Bounding box (x0, y0, x1, y1). Default:
None.- normalized_textstr or None, optional
Pre-normalised text. Default:
None.- tokenslist of str or None, optional
Tokenised words. Default:
None.- lemmaslist of str or None, optional
Lemmatised tokens. Default:
None.- stemslist of str or None, optional
Stemmed tokens. Default:
None.- keywordslist of str or None, optional
Extracted keyphrases. Default:
None.
- Returns:
- CorpusDocument
Validated document instance.
- Raises:
- ValueError
If any invariant from
validateis violated.
- Parameters:
input_path (str)
chunk_index (int)
text (str | None)
section_type (SectionType)
chunking_strategy (ChunkingStrategy)
language (str | None)
char_start (int | None)
char_end (int | None)
embedding (Any | None)
doc_id (str | None)
source_type (SourceType)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
url (str | None)
doi (str | None)
isbn (str | None)
page_number (int | None)
paragraph_index (int | None)
line_number (int | None)
parent_doc_id (str | None)
act (int | None)
scene_number (int | None)
timecode_start (float | None)
timecode_end (float | None)
confidence (float | None)
ocr_engine (str | None)
normalized_text (str | None)
raw_text (str | None)
modality (Modality | None)
raw_bytes (bytes | None)
raw_tensor (Any)
raw_dtype (str | None)
frame_index (int | None)
content_hash (str | None)
script (str | None)
script_direction (str | None)
grapheme_count (int | None)
codepoint_count (int | None)
is_mixed_script (bool | None)
script_spans (list | None)
chunking_unit (str | None)
semanteme_count (int | None)
determinative_groups (list | None)
script_model_version (str | None)
- Return type:
Examples
>>> doc = CorpusDocument.create( ... input_path="corpus.txt", ... chunk_index=0, ... text="Hello world.", ... source_type=SourceType.BOOK, ... language="en", ... ) >>> doc.validate() >>> doc.has_embedding False
- determinative_groups: list | None = None#
list of determinative group dicts.
Each element is a dict:
{ "glyphs": str, # raw glyph codepoints "determinative": str, # semantic category glyph "category": str, # human-readable category label }
Nonefor all non-hieroglyphic scripts.- Type:
For Egyptian hieroglyphic chunks
- frame_index: int | None = None#
None.- Type:
Zero-based frame index in a video or multi-frame image. Default
- classmethod from_dict(data)[source]#
Reconstruct a
CorpusDocumentfrom a plain dictionary.- Parameters:
- datadict
Dictionary as returned by
to_dict. Enum fields are coerced from string values.bboxis restored from list to tuple.metadatadefaults to empty dict if absent.
- Returns:
- CorpusDocument
Validated reconstructed document.
- Raises:
- ValueError
If required fields are missing or values are invalid.
- Parameters:
- Return type:
Self
Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> d = doc.to_dict() >>> restored = CorpusDocument.from_dict(d) >>> restored.doc_id == doc.doc_id True
- grapheme_count: int | None = None#
Number of grapheme clusters in
text.This is the correct user-perceived character count as defined by Unicode UAX #29. Always
<= len(text)because each grapheme cluster is at least one codepoint.NoneifGraphemeClusterNormalizerhas not been applied.
- property has_embedding: bool#
Return
Trueif an embedding has been attached to this document.- Returns:
- bool
Truewhenembeddingis notNone.
Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> doc.has_embedding False
- is_mixed_script: bool | None = None#
Trueif the chunk contains codepoints from more than one Unicode script block above a noise threshold.Noneif not analysed.
- static make_content_hash(text=None, raw_bytes=None)[source]#
Compute a 32-char SHA-256 hex digest for deduplication.
- Parameters:
- textstr or None
Text content. Used when
raw_bytesisNone.- raw_bytesbytes or None
Raw media bytes. Preferred over
textwhen set.
- Returns:
- str
32-character hex SHA-256 prefix.
- Parameters:
- Return type:
Notes
Empty /
Noneinputs return a fixed sentinel value"0" * 32(32 zeros) to ensurecontent_hashis always populated and the dedup logic is deterministic.
- classmethod make_doc_id(input_path, chunk_index, text, source_type=SourceType.UNKNOWN)[source]#
Compute a deterministic 16-character hex document identifier.
The id is a SHA-1 prefix of
"{source_type}:{input_path}:{chunk_index}:{text[:64]}". Identical inputs always produce the same id.- Parameters:
- input_pathstr
Name of the source file (not a full path).
- chunk_indexint
Zero-based chunk position within the document.
- textstr
Raw text content of the chunk (only the first 64 characters are used to keep hashing fast).
- source_typeSourceType, optional
Source kind. Including this in the hash preimage prevents collisions when a BOOK chapter and a MOVIE subtitle share the same filename, chunk index, and opening text (Issue S-7). Default:
SourceType.UNKNOWN.
- Returns:
- str
16-character lowercase hexadecimal string.
- Parameters:
input_path (str)
chunk_index (int)
text (str)
source_type (SourceType)
- Return type:
Notes
Adding
source_typeto the hash preimage is a one-time breaking change for corpora built before this version. Existing corpora must be re-indexed when upgrading.Examples
>>> CorpusDocument.make_doc_id("file.txt", 0, "Hello world.") '...' # deterministic 16-char hex >>> ( ... CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.BOOK) ... != CorpusDocument.make_doc_id("f.txt", 0, "Hi", SourceType.MOVIE) ... ) True
- morphemes: list[str] | None = None#
Morpheme list if
MORPHOLOGICALorHYBRIDbackend was used.Excluded from
reprand equality comparisons (liketokens/lemmas).Noneif semantic chunking was not applied or a non-morphological backend was selected.
- raw_shape: tuple[int, ...] | None = None#
None.- Type:
Shape of
raw_tensoras a plain Python tuple. Default
- raw_tensor: Any = None#
Decoded media array ready for model input. Shape conventions: image
(H,W,C)uint8; audio(samples,)float32; video(T,H,W,C)uint8.Nonefor text-only.
- raw_text: str | None = None#
Verbatim source text before any normalisation or NLP processing.
Populated by every reader in the corpus pipeline so that the before/after transformation can always be compared at the document level:
ImageReader— exact Tesseract / easyocr output bytes before any chunker or NLP step.AudioReader— pre-LRC-inline-tag-strip or pre-VTT-HTML-strip cue text; verbatim Whisper/NeMo ASR output; classifier label text (no pre-processing, equalstext).VideoReader— pre-HTML-strip SRT/SBV/VTT cue text; verbatim Whisper ASR output.PDFReader— backend extraction result before.strip()(preserves original page boundary whitespace).TextReader— full file content as read; no pre-processing occurs soraw_text == text.XMLReader/TEIReader—itertext()join before_WS_REwhitespace collapsing.ALTOReader— verbatim ALTOCONTENTattribute tokens; no additional normalisation, soraw_text == text.WebReader— inner HTML of the matched element (tags included) beforeget_text()strips them.YouTubeReader— pre-HTML-strip cue text from the transcript API (may contain<c>tags or HTML entities).
Use this field to compare what each reader returned against:
text— the chunked form (post-chunker, no NLP)normalized_text— the post-TextNormalizerform usedfor embedding
Three-tier comparison for quality audit:
raw_text → verbatim reader output before any cleaning text → cleaned / chunked form normalized_text → NFKC + ligature expansion + hyphen-join + whitespace collapse
Notes
For multilingual images, accuracy requires Tesseract to be invoked with the correct
ocr_langstring (e.g."eng+deu+ara+heb+tur+ell"). Withocr_lang=None(the default), Tesseract uses English-only and silently transliterates Arabic / Hebrew / Greek glyphs into Latin lookalikes.raw_textthen reflects that garbled output, NOT the original script — the problem belongs to the pipeline caller, not here.
- replace(**changes)[source]#
Return a new
CorpusDocumentwith the specified fields replaced.- Parameters:
- **changesAny
Field names and new values. Only fields defined on
CorpusDocumentare accepted.
- Returns:
- CorpusDocument
New instance with changed fields; original is unchanged.
- Raises:
- ValueError
If an unknown field name is given.
- Parameters:
changes (Any)
- Return type:
Self
Examples
>>> import numpy as np >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> enriched = doc.replace(embedding=np.zeros(768, dtype=np.float32)) >>> enriched.has_embedding True >>> doc.has_embedding # original unchanged False
- script: str | None = None#
Dominant script of this chunk.
Set to a
ScriptTypevalue string (e.g."latin","arabic","han").Noneif the chunker was script-unaware or no script was detected.
- script_direction: str | None = None#
Writing direction of the dominant script.
One of
"ltr"(left-to-right),"rtl"(right-to-left), or"ttb"(top-to-bottom, traditional Mongolian).Noneif not detected.
- script_model_version: str | None = None#
Version of the embedding or dictionary model used during semantic chunking.
Required for idempotency verification on pipeline re-runs. Format:
"<model_name>@<version>", e.g."paraphrase-multilingual-mpnet-base-v2@1.2.0".Nonewhen theMORPHOLOGICALbackend was used (always idempotent) or when semantic chunking was not applied.
- script_spans: list | None = None#
list of ScriptSpan dicts.
Each element is a dict:
{ "text": str, # span text (NFC) "script": str, # ScriptType value string "direction": str, # "ltr" | "rtl" | "ttb" "start": int, # grapheme cluster index (inclusive) "end": int, # grapheme cluster index (exclusive) }
Integer indices refer to the grapheme cluster list produced by
GraphemeClusterNormalizer.Nonefor single-script chunks or when script analysis was skipped.- Type:
For mixed-script chunks
- section_type: SectionType = 'text'[source]#
Semantic role of this chunk.
- semanteme_count: int | None = None#
Number of semantemes identified in this chunk.
Set by
SemanticChunkeronly.Noneif semantic chunking was not used.
- source_type: SourceType = 'unknown'[source]#
Kind of source (BOOK, MOVIE, RESEARCH, WIKI, …).
- timecode_start: float | None = None#
Start timecode in seconds for subtitle / video / audio sources.
- to_dict(*, include_embedding=False)[source]#
Serialise to a plain Python dictionary.
- Parameters:
- include_embeddingbool, optional
When
True, include theembeddingfield serialised as a flat list of floats (if present). Default:False— embeddings are excluded to keep dicts JSON-safe by default.
- Returns:
- dict
Shallow copy of all fields. Enum fields serialised as string values.
bboxserialised as a list (JSON-compatible).metadatais a shallow copy.
- Parameters:
include_embedding (bool)
- Return type:
Notes
This method does not call
validate— it is designed to be fast and usable even on partially-constructed documents during debugging.Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> d = doc.to_dict() >>> isinstance(d["section_type"], str) True >>> d["source_type"] 'unknown'
- to_flat_dict(*, include_embedding=False)[source]#
Serialise to a flat dictionary with metadata fields promoted to the top level.
Unlike
to_dict, themetadatasub-dict is merged into the top level. Core fields take precedence over metadata fields with the same key name.- Parameters:
- include_embeddingbool, optional
When
True, includeembeddingas a list of floats. Default:False.
- Returns:
- dict
Flat dict suitable for a single row in a tabular export.
- Parameters:
include_embedding (bool)
- Return type:
Notes
Metadata key collisions with core fields are logged as warnings.
Examples
>>> doc = CorpusDocument.create( ... "f.txt", 0, "Hello.", metadata={"custom_key": "v"} ... ) >>> flat = doc.to_flat_dict() >>> flat["custom_key"] 'v'
- to_pandas_row(*, include_embedding=False)[source]#
Return a dict formatted for a single row in a
pandas.DataFrame.- Parameters:
- include_embeddingbool, optional
When
True, include the embedding as a numpy array (not a list), allowingpandasto store it as an object column. Default:False.
- Returns:
- dict
Row dict with enums as strings. Embedding kept as-is when present and
include_embedding=True.
- Parameters:
include_embedding (bool)
- Return type:
Examples
>>> import pandas as pd >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> row = doc.to_pandas_row() >>> pd.DataFrame([row])["text"][0] 'Hello.'
- to_polars_row(*, include_embedding=False)[source]#
Return a dict formatted for a single row in a
polars.DataFrame.- Parameters:
- include_embeddingbool, optional
When
True, include the embedding as a list of floats (polars does not accept numpy arrays directly in dict-based construction). Default:False.
- Returns:
- dict
Row dict. Embedding serialised as
list[float]when present.
- Parameters:
include_embedding (bool)
- Return type:
Examples
>>> import polars as pl >>> doc = CorpusDocument.create("f.txt", 0, "Hello.") >>> pl.DataFrame([doc.to_polars_row()])["text"][0] 'Hello.'
- validate()[source]#
Assert that all invariants hold. Raises on the first violation.
- Raises:
- ValueError
With an actionable message identifying the violated invariant and the offending value.
- Warns:
- UserWarning
When
doidoes not match the10.XXXX/prefix pattern. A warning (not a raise) is used because real-world DOIs are not always well-formed, and hard rejection would discard valid papers.
- Return type:
None
Notes
Call
validate()explicitly after constructing a document via the dataclass constructor. Thecreatefactory calls it automatically.Examples
>>> doc = CorpusDocument.create("f.txt", 0, "Hello world.") >>> doc.validate() # no exception
>>> bad = CorpusDocument( ... doc_id="", input_path="f.txt", chunk_index=0, text="Hello." ... ) >>> bad.validate() Traceback (most recent call last): ... ValueError: CorpusDocument.doc_id must be a non-empty string; got ''
Gallery examples#
corpus WHO European Region local or url per file with examples