SentenceChunkerConfig#

class scikitplot.corpus.SentenceChunkerConfig(backend=SentenceBackend.REGEX, min_length=1, overlap=0, spacy_model=None, nltk_language='english', strip_whitespace=True, include_offsets=True, custom_splitter=None, script_hint=None)[source]#

Configuration for SentenceChunker.

Parameters:
backendSentenceBackend

Splitting strategy. REGEX has no extra dependencies. NLTK requires the punkt model. SPACY requires a loaded model name via spacy_model.

min_lengthint

Minimum character length for a sentence to be kept.

overlapint

Number of preceding sentences to prepend as context.

spacy_modelstr or None

Spacy model name, e.g. "en_core_web_sm". Required when backend is SPACY.

nltk_languagestr or list[str] or None

Language(s) forwarded to nltk.tokenize.sent_tokenize. Accepts ISO 639-1 codes, NLTK names, lists, or None (auto-detect from text). See nltk_language field docstring for full details.

strip_whitespacebool

Strip leading/trailing whitespace from each sentence.

include_offsetsbool

Compute character offsets (start_char, end_char).

Parameters:
backend: SentenceBackend = 'regex'[source]#
custom_splitter: Any = None#

User-supplied splitter for backend=SentenceBackend.CUSTOM.

Accepts any object with a split(text: str) -> list[str] method (SentenceSplitterProtocol) or a plain callable, which is auto-wrapped in FunctionSentenceSplitter.

include_offsets: bool = True#
min_length: int = 1#
nltk_language: Any = 'english'#

Language(s) for the NLTK Punkt sentence tokenizer.

Accepts:

  • "english" — NLTK language name (backward-compatible default)

  • "en" — ISO 639-1 two-letter code, resolved automatically

  • ["en", "de"] — multi-language: first NLTK-supported language used

  • None — auto-detect from text via detect_script

When a list is provided, the first NLTK-compatible language in the list is used (NLTK’s Punkt tokenizer handles one language per call). For documents with mixed languages, prefer backend=SentenceBackend.REGEX with script_hint=None (auto-detect) or SentenceBackend.CUSTOM with a language-aware splitter.

Supports 200+ languages via _language_data. ISO codes, NLTK names, and regional aliases (e.g. "chilean_spanish") all resolve.

overlap: int = 0#
script_hint: str | None = None#

Optional Unicode script hint for the REGEX backend.

When set to "multi" (or any non-None value), the REGEX backend uses MULTI_SCRIPT_SENTENCE_RE_PATTERN which covers CJK (。!?), Arabic (؟), Devanagari (), Ethiopic (), and Latin terminators. When None (default), the legacy Latin-only regex is used.

Valid values: None (Latin), "multi" (all scripts), or any ScriptType value string.

spacy_model: str | None = None#
strip_whitespace: bool = True#