SentenceChunker#
- class scikitplot.corpus.SentenceChunker(config=None)[source]#
Split a document into sentence-level
Chunkobjects.- Parameters:
- configstr or SentenceChunkerConfig or None, optional
Three accepted forms:
None(default)Constructs a
SentenceChunkerConfigwith all defaults:REGEXbackend,min_length=10, no overlap.strShorthand for the
SPACYbackend. The string is interpreted as the spaCy model name, equivalent to:SentenceChunkerConfig( backend=SentenceBackend.SPACY, spacy_model=<value>, )
SentenceChunkerConfigFull explicit configuration.
- Raises:
- TypeError
If config is not
str,SentenceChunkerConfig, orNone.- ValueError
If the resolved configuration is invalid (negative lengths, missing model name for SPACY backend, etc.).
- Parameters:
config (str | SentenceChunkerConfig | None)
Notes
spaCy model caching — the loaded
nlpobject is stored inself._nlp_cache(a plaindict) keyed by model name. The cache is passed into_split_spacyon every call, sospacy.loadis invoked at most once per model per chunker instance.Examples
Default REGEX backend:
>>> chunker = SentenceChunker() >>> result = chunker.chunk("Hello world. How are you? Fine thanks.") >>> len(result.chunks) 3 >>> result.chunks[0].text 'Hello world.'
spaCy shorthand (model name as string):
>>> chunker = SentenceChunker("en_core_web_sm") >>> chunker.config.backend <SentenceBackend.SPACY: 'spacy'> >>> chunker.config.spacy_model 'en_core_web_sm'
Explicit config:
>>> from scikitplot.corpus._chunkers._sentence import SentenceChunkerConfig >>> cfg = SentenceChunkerConfig(backend=SentenceBackend.NLTK, min_length=5) >>> chunker = SentenceChunker(cfg)
- chunk(text, doc_id=None, extra_metadata=None)[source]#
Split text into sentence-level chunks.
- Parameters:
- textstr
Raw document text.
- doc_idstr, optional
Document identifier stored in chunk metadata.
- extra_metadatadict[str, Any], optional
Additional key/value pairs merged into the result metadata.
- Returns:
- ChunkResult
Chunks and aggregate metadata.
- Raises:
- TypeError
If text is not a
str.- ValueError
If text is empty or whitespace-only.
- Parameters:
- Return type:
ChunkResult
- chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#
Chunk a list of documents.
- Parameters:
- textslist[str]
Input documents.
- doc_idslist[str], optional
Parallel document identifiers.
- extra_metadatadict[str, Any], optional
Shared metadata merged into every result.
- Returns:
- list[ChunkResult]
One result per document.
- Raises:
- TypeError
If texts is not a list.
- ValueError
If doc_ids length does not match texts length.
- Parameters:
- Return type:
list[ChunkResult]
- property config: SentenceChunkerConfig#
The resolved
SentenceChunkerConfigfor this instance.
Gallery examples#
corpus Knowledge and Information local .png with examples
corpus WHO European Region YouTube shorts with examples
corpus WHO European Region local .zip with examples