CustomChunker#
- class scikitplot.corpus.CustomChunker(chunk_fn, *, name=None)[source]#
Wrap any callable as a
ChunkerBase.The caller provides a
chunk_fnthat accepts(text: str, metadata: dict)and returnslist[tuple[int, str]]where each tuple is(char_start, chunk_text). This covers the fullchunkcontract without subclassing.- Parameters:
- chunk_fncallable
Chunking callable. Signature:
def chunk_fn( text: str, metadata: dict[str, Any], ) -> list[tuple[int, str]]: ...
textis the raw text block to segment.metadatais the raw-chunk metadata dict passed byget_documents. Returns(char_start, chunk_text)pairs, same contract asChunkerBase.chunk.- namestr, optional
Human-readable label used in
__repr__and logging. Default: the__name__attribute ofchunk_fn.
- Attributes:
- strategyChunkingStrategy
Always
CUSTOM.
- Raises:
- TypeError
If
chunk_fnis not callable.
- Parameters:
See also
scikitplot.corpus._base.ChunkerBaseAbstract base class.
scikitplot.corpus._chunkers.SentenceChunkerBuilt-in sentence chunker.
Notes
User note: Use this when none of the built-in chunkers fit your segmentation logic — custom XML tag boundaries, transcript cue-based splits, semantic paragraph detection via a local LLM, etc.
Developer note: The
strategyclass variable is fixed toCUSTOMso the pipeline records the correctChunkingStrategyon every generatedCorpusDocument.Examples
Split on double newlines (paragraph-like) without using ParagraphChunker:
def my_para_chunk(text, metadata): paras = [p.strip() for p in text.split("\\n\\n") if p.strip()] cursor = 0 result = [] for para in paras: idx = text.find(para, cursor) result.append((idx, para)) cursor = idx + len(para) return result chunker = CustomChunker(my_para_chunk, name="DoubleNewlineChunker") pipeline = CorpusPipeline(chunker=chunker)
- chunk(text, metadata=None)[source]#
Delegate to the user-supplied
chunk_fn.- Parameters:
- textstr
Raw text to segment.
- metadatadict or None, optional
Raw-chunk metadata forwarded from the reader.
- Returns:
- list of (int, str)
(char_start, chunk_text)pairs.
- Raises:
- ValueError
If
textisNone.- RuntimeError
If
chunk_fnraises an unexpected exception.
- Parameters:
- Return type: