SentenceChunkerConfig#
- class scikitplot.corpus.SentenceChunkerConfig(backend=SentenceBackend.REGEX, min_length=1, overlap=0, spacy_model=None, nltk_language='english', strip_whitespace=True, include_offsets=True, custom_splitter=None, script_hint=None)[source]#
Configuration for
SentenceChunker.- Parameters:
- backendSentenceBackend
Splitting strategy.
REGEXhas no extra dependencies.NLTKrequires the punkt model.SPACYrequires a loaded model name via spacy_model.- min_lengthint
Minimum character length for a sentence to be kept.
- overlapint
Number of preceding sentences to prepend as context.
- spacy_modelstr or None
Spacy model name, e.g.
"en_core_web_sm". Required when backend isSPACY.- nltk_languagestr or list[str] or None
Language(s) forwarded to
nltk.tokenize.sent_tokenize. Accepts ISO 639-1 codes, NLTK names, lists, orNone(auto-detect from text). Seenltk_languagefield docstring for full details.- strip_whitespacebool
Strip leading/trailing whitespace from each sentence.
- include_offsetsbool
Compute character offsets (
start_char,end_char).
- Parameters:
- backend: SentenceBackend = 'regex'[source]#
- custom_splitter: Any = None#
User-supplied splitter for
backend=SentenceBackend.CUSTOM.Accepts any object with a
split(text: str) -> list[str]method (SentenceSplitterProtocol) or a plain callable, which is auto-wrapped inFunctionSentenceSplitter.
- nltk_language: Any = 'english'#
Language(s) for the NLTK Punkt sentence tokenizer.
Accepts:
"english"— NLTK language name (backward-compatible default)"en"— ISO 639-1 two-letter code, resolved automatically["en", "de"]— multi-language: first NLTK-supported language usedNone— auto-detect from text viadetect_script
When a list is provided, the first NLTK-compatible language in the list is used (NLTK’s Punkt tokenizer handles one language per call). For documents with mixed languages, prefer
backend=SentenceBackend.REGEXwithscript_hint=None(auto-detect) orSentenceBackend.CUSTOMwith a language-aware splitter.Supports 200+ languages via
_language_data. ISO codes, NLTK names, and regional aliases (e.g."chilean_spanish") all resolve.
- script_hint: str | None = None#
Optional Unicode script hint for the REGEX backend.
When set to
"multi"(or any non-Nonevalue), the REGEX backend usesMULTI_SCRIPT_SENTENCE_RE_PATTERNwhich covers CJK (。!?), Arabic (؟), Devanagari (।), Ethiopic (።), and Latin terminators. WhenNone(default), the legacy Latin-only regex is used.Valid values:
None(Latin),"multi"(all scripts), or anyScriptTypevalue string.
Gallery examples#
corpus Knowledge and Information local .png with examples
corpus WHO European Region YouTube shorts with examples
corpus WHO European Region local .zip with examples