ChunkerBase#

class scikitplot.corpus.ChunkerBase[source]#

Abstract base class for all text chunkers.

A chunker receives a block of raw text (one logical unit from the source document — a page, paragraph block, section, etc.) and returns a list of (char_start, chunk_text) tuples. The char_start offset is relative to the beginning of the input text block, enabling downstream code to reconstruct absolute character positions.

Parameters:

None — subclasses define their own parameters.

Attributes:

strategyChunkingStrategy: Class variable. Identifies which ChunkingStrategy enum member this chunker implements. Must be defined by every concrete subclass.

See also

scikitplot.corpus._chunkers.SentenceChunker: spaCy sentence segmentation.
scikitplot.corpus._chunkers.ParagraphChunker: Blank-line paragraph split.
scikitplot.corpus._chunkers.FixedWindowChunker: Sliding-window with overlap.

Notes

Chunkers must be stateless between chunk() calls. Any state required for a single call (e.g. a loaded language model) must be initialised inside chunk() or cached as an instance attribute that is never mutated after first assignment.

Examples

Implementing a trivial single-chunk chunker (no splitting):

>>> class NullChunker(ChunkerBase):
...     strategy = ChunkingStrategy.NONE
...
...     def chunk(self, text, metadata=None):
...         return [(0, text)] if text.strip() else []

assert_modality(doc_modality)[source]#

Raise ValueError if this chunker cannot handle doc_modality.

Parameters:

doc_modalityModality: The modality of the document about to be chunked.

Raises:

ValueError: If doc_modality is not in supported_modalities.

Parameters:

doc_modality (Modality)

Return type:

None

Notes

HIGH-03c fix: call this at the start of chunk to prevent silent garbage output when the wrong chunker is applied to a non-TEXT document. Example:

def chunk(self, text, metadata=None):
    self.assert_modality(
        Modality((metadata or {}).get("modality", Modality.TEXT))
    )
    ...

abstractmethod chunk(text, metadata=None)[source]#

Segment text into a ChunkResult.

CRITICAL-02 (Phase 2): Return type unified to ChunkResult across all implementations. Callers iterate result.chunks directly; the intermediate list[tuple[int, str]] contract has been retired.

Parameters:

textstr: Raw text to segment. Must not be None. Empty string input must return a ChunkResult with an empty chunks list (never raise).
metadatadict or None, optional: Chunk-level metadata from the reader (e.g. page number, section type). Available so chunkers that need context can access it. Default: None.

Returns:

ChunkResult: Ordered list of Chunk objects. Each chunk carries text, start_char, end_char, and metadata. The list must be non-empty only when text contains meaningful content.

Raises:

ValueError: If text is None (not just empty).

Parameters:

text (str)
metadata (dict[str, Any] | None)

Return type:

ChunkResult

Notes

Backward compat: ChunkerBridge wraps all new-style standalone chunkers and returns ChunkResult from its chunk() method. Pre-CRITICAL-02 user subclasses of ChunkerBase that return list[tuple[int, str]] should migrate to ChunkResult; the pipeline will raise AttributeError if chunk_result.chunks is not accessible.

strategy: ClassVar[ChunkingStrategy]#: Identifies which ChunkingStrategy this implementation provides. Must be defined on every concrete subclass.

supported_modalities: ClassVar[frozenset[Modality]] = frozenset({Modality.TEXT})#

{Modality.TEXT}. Subclasses that support additional modalities (e.g. AUDIO transcripts) must override this set.

Type:: Modalities this chunker can handle. Default

version: ClassVar[str] = '1.0.0'#

"1.0.0". Subclasses must override to reflect their actual version so that corpus snapshots remain reproducible after upgrades.

Type:: SemVer string for this chunker implementation. Default