SentenceChunker#

class scikitplot.corpus.SentenceChunker(config=None)[source]#

Split a document into sentence-level Chunk objects.

Parameters:

configstr or SentenceChunkerConfig or None, optional

Three accepted forms:

None (default)

Constructs a SentenceChunkerConfig with all defaults: REGEX backend, min_length=10, no overlap.

str

Shorthand for the SPACY backend. The string is interpreted as the spaCy model name, equivalent to:

SentenceChunkerConfig(
    backend=SentenceBackend.SPACY,
    spacy_model=<value>,
)

SentenceChunkerConfig

Full explicit configuration.

Raises:

TypeError: If config is not str, SentenceChunkerConfig, or None.
ValueError: If the resolved configuration is invalid (negative lengths, missing model name for SPACY backend, etc.).

Parameters:

config (str | SentenceChunkerConfig | None)

Notes

spaCy model caching — the loaded nlp object is stored in self._nlp_cache (a plain dict) keyed by model name. The cache is passed into _split_spacy on every call, so spacy.load is invoked at most once per model per chunker instance.

Examples

Default REGEX backend:

>>> chunker = SentenceChunker()
>>> result = chunker.chunk("Hello world. How are you? Fine thanks.")
>>> len(result.chunks)
3
>>> result.chunks[0].text
'Hello world.'

spaCy shorthand (model name as string):

>>> chunker = SentenceChunker("en_core_web_sm")
>>> chunker.config.backend
<SentenceBackend.SPACY: 'spacy'>
>>> chunker.config.spacy_model
'en_core_web_sm'

Explicit config:

>>> from scikitplot.corpus._chunkers._sentence import SentenceChunkerConfig
>>> cfg = SentenceChunkerConfig(backend=SentenceBackend.NLTK, min_length=5)
>>> chunker = SentenceChunker(cfg)

chunk(text, doc_id=None, extra_metadata=None)[source]#

Split text into sentence-level chunks.

Parameters:

textstr: Raw document text.
doc_idstr, optional: Document identifier stored in chunk metadata.
extra_metadatadict[str, Any], optional: Additional key/value pairs merged into the result metadata.

Returns:

ChunkResult: Chunks and aggregate metadata.

Raises:

TypeError: If text is not a str.
ValueError: If text is empty or whitespace-only.

Parameters:

text (str)
doc_id (str | None)
extra_metadata (dict[str, Any] | None)

Return type:

ChunkResult

chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#

Chunk a list of documents.

Parameters:

textslist[str]: Input documents.
doc_idslist[str], optional: Parallel document identifiers.
extra_metadatadict[str, Any], optional: Shared metadata merged into every result.

Returns: