ChunkerBase#
- class scikitplot.corpus.ChunkerBase[source]#
Abstract base class for all text chunkers.
A chunker receives a block of raw text (one logical unit from the source document — a page, paragraph block, section, etc.) and returns a list of
(char_start, chunk_text)tuples. Thechar_startoffset is relative to the beginning of the input text block, enabling downstream code to reconstruct absolute character positions.- Parameters:
- None — subclasses define their own parameters.
- Attributes:
- strategyChunkingStrategy
Class variable. Identifies which
ChunkingStrategyenum member this chunker implements. Must be defined by every concrete subclass.
See also
scikitplot.corpus._chunkers.SentenceChunkerspaCy sentence segmentation.
scikitplot.corpus._chunkers.ParagraphChunkerBlank-line paragraph split.
scikitplot.corpus._chunkers.FixedWindowChunkerSliding-window with overlap.
Notes
Chunkers must be stateless between
chunk()calls. Any state required for a single call (e.g. a loaded language model) must be initialised insidechunk()or cached as an instance attribute that is never mutated after first assignment.Examples
Implementing a trivial single-chunk chunker (no splitting):
>>> class NullChunker(ChunkerBase): ... strategy = ChunkingStrategy.NONE ... ... def chunk(self, text, metadata=None): ... return [(0, text)] if text.strip() else []
- abstractmethod chunk(text, metadata=None)[source]#
Segment
textinto a list of(char_start, chunk_text)tuples.- Parameters:
- textstr
Raw text to segment. Must not be
None. Empty string input must return an empty list (never raise).- metadatadict or None, optional
Chunk-level metadata from the reader (e.g. page number, section type). Made available so chunkers that need context — e.g. a semantic chunker deciding boundaries based on section label — can access it. Default:
None.
- Returns:
- list of (int, str)
Ordered list of
(char_start, chunk_text)pairs.char_startis the character offset ofchunk_textwithin the inputtextstring. Must be non-negative and monotonically non-decreasing across the list.
- Raises:
- ValueError
If
textisNone(not just empty).
- Parameters:
- Return type:
Notes
The return type is a list (not a generator) so callers can inspect length without consuming the iterator. For very large texts, chunkers should still return incrementally-built lists rather than loading everything into memory at once.