CustomChunker#

class scikitplot.corpus.CustomChunker(chunk_fn, *, name=None)[source]#

Wrap any callable as a ChunkerBase.

The caller provides a chunk_fn that accepts (text: str, metadata: dict) and returns list[tuple[int, str]] where each tuple is (char_start, chunk_text). This covers the full chunk contract without subclassing.

Parameters:
chunk_fncallable

Chunking callable. Signature:

def chunk_fn(
    text: str,
    metadata: dict[str, Any],
) -> list[tuple[int, str]]: ...

text is the raw text block to segment. metadata is the raw-chunk metadata dict passed by get_documents. Returns (char_start, chunk_text) pairs, same contract as ChunkerBase.chunk.

namestr, optional

Human-readable label used in __repr__ and logging. Default: the __name__ attribute of chunk_fn.

Attributes:
strategyChunkingStrategy

Always CUSTOM.

Raises:
TypeError

If chunk_fn is not callable.

Parameters:

See also

scikitplot.corpus._base.ChunkerBase

Abstract base class.

scikitplot.corpus._chunkers.SentenceChunker

Built-in sentence chunker.

Notes

User note: Use this when none of the built-in chunkers fit your segmentation logic — custom XML tag boundaries, transcript cue-based splits, semantic paragraph detection via a local LLM, etc.

Developer note: The strategy class variable is fixed to CUSTOM so the pipeline records the correct ChunkingStrategy on every generated CorpusDocument.

Examples

Split on double newlines (paragraph-like) without using ParagraphChunker:

def my_para_chunk(text, metadata):
    paras = [p.strip() for p in text.split("\\n\\n") if p.strip()]
    cursor = 0
    result = []
    for para in paras:
        idx = text.find(para, cursor)
        result.append((idx, para))
        cursor = idx + len(para)
    return result

chunker = CustomChunker(my_para_chunk, name="DoubleNewlineChunker")
pipeline = CorpusPipeline(chunker=chunker)
chunk(text, metadata=None)[source]#

Delegate to the user-supplied chunk_fn.

Parameters:
textstr

Raw text to segment.

metadatadict or None, optional

Raw-chunk metadata forwarded from the reader.

Returns:
list of (int, str)

(char_start, chunk_text) pairs.

Raises:
ValueError

If text is None.

RuntimeError

If chunk_fn raises an unexpected exception.

Parameters:
Return type:

list[tuple[int, str]]

strategy: ChunkingStrategy = 'custom'[source]#

Identifies which ChunkingStrategy this implementation provides. Must be defined on every concrete subclass.