ChunkerBase#
- class scikitplot.corpus.ChunkerBase[source]#
Abstract base class for all text chunkers.
A chunker receives a block of raw text (one logical unit from the source document — a page, paragraph block, section, etc.) and returns a list of
(char_start, chunk_text)tuples. Thechar_startoffset is relative to the beginning of the input text block, enabling downstream code to reconstruct absolute character positions.- Parameters:
- None — subclasses define their own parameters.
- Attributes:
- strategyChunkingStrategy
Class variable. Identifies which
ChunkingStrategyenum member this chunker implements. Must be defined by every concrete subclass.
See also
scikitplot.corpus._chunkers.SentenceChunkerspaCy sentence segmentation.
scikitplot.corpus._chunkers.ParagraphChunkerBlank-line paragraph split.
scikitplot.corpus._chunkers.FixedWindowChunkerSliding-window with overlap.
Notes
Chunkers must be stateless between
chunk()calls. Any state required for a single call (e.g. a loaded language model) must be initialised insidechunk()or cached as an instance attribute that is never mutated after first assignment.Examples
Implementing a trivial single-chunk chunker (no splitting):
>>> class NullChunker(ChunkerBase): ... strategy = ChunkingStrategy.NONE ... ... def chunk(self, text, metadata=None): ... return [(0, text)] if text.strip() else []
- assert_modality(doc_modality)[source]#
Raise
ValueErrorif this chunker cannot handle doc_modality.- Parameters:
- doc_modalityModality
The modality of the document about to be chunked.
- Raises:
- ValueError
If doc_modality is not in
supported_modalities.
- Parameters:
doc_modality (Modality)
- Return type:
None
Notes
HIGH-03c fix: call this at the start of
chunkto prevent silent garbage output when the wrong chunker is applied to a non-TEXT document. Example:def chunk(self, text, metadata=None): self.assert_modality( Modality((metadata or {}).get("modality", Modality.TEXT)) ) ...
- abstractmethod chunk(text, metadata=None)[source]#
Segment
textinto aChunkResult.CRITICAL-02 (Phase 2): Return type unified to
ChunkResultacross all implementations. Callers iterateresult.chunksdirectly; the intermediatelist[tuple[int, str]]contract has been retired.- Parameters:
- textstr
Raw text to segment. Must not be
None. Empty string input must return aChunkResultwith an emptychunkslist (never raise).- metadatadict or None, optional
Chunk-level metadata from the reader (e.g. page number, section type). Available so chunkers that need context can access it. Default:
None.
- Returns:
- ChunkResult
Ordered list of
Chunkobjects. Each chunk carriestext,start_char,end_char, andmetadata. The list must be non-empty only when text contains meaningful content.
- Raises:
- ValueError
If
textisNone(not just empty).
- Parameters:
- Return type:
ChunkResult
Notes
Backward compat:
ChunkerBridgewraps all new-style standalone chunkers and returnsChunkResultfrom itschunk()method. Pre-CRITICAL-02 user subclasses ofChunkerBasethat returnlist[tuple[int, str]]should migrate toChunkResult; the pipeline will raiseAttributeErrorifchunk_result.chunksis not accessible.
- strategy: ClassVar[ChunkingStrategy]#
Identifies which
ChunkingStrategythis implementation provides. Must be defined on every concrete subclass.