SentenceChunkerConfig#

class scikitplot.corpus.SentenceChunkerConfig(backend=SentenceBackend.REGEX, min_length=1, overlap=0, spacy_model=None, nltk_language='english', strip_whitespace=True, include_offsets=True, custom_splitter=None, script_hint=None, multilang_config=None)[source]#

Configuration for SentenceChunker.

Parameters:

backendSentenceBackend: Splitting strategy. REGEX has no extra dependencies. NLTK requires the punkt model. SPACY requires a loaded model name via spacy_model.
min_lengthint: Minimum character length for a sentence to be kept.
overlapint: Number of preceding sentences to prepend as context.
spacy_modelstr or None: Spacy model name, e.g. "en_core_web_sm". Required when backend is SPACY.
nltk_languagestr or list[str] or None: Language(s) forwarded to nltk.tokenize.sent_tokenize. Accepts ISO 639-1 codes, NLTK names, lists, or None (auto-detect from text). See nltk_language field docstring for full details.
strip_whitespacebool: Strip leading/trailing whitespace from each sentence.
include_offsetsbool: Compute character offsets (start_char, end_char).

Parameters:

backend (SentenceBackend)
min_length (int)
overlap (int)
spacy_model (str | None)
nltk_language (str | list[str] | None)
strip_whitespace (bool)
include_offsets (bool)
custom_splitter (SentenceSplitterProtocol | callable[[str], list[str]] | None)
script_hint (str | None)
multilang_config (MultilangConfig | None)

backend: SentenceBackend = 'regex'[source]#

custom_splitter: SentenceSplitterProtocol | callable[[str], list[str]] | None = None#

User-supplied splitter for backend=SentenceBackend.CUSTOM.

Accepts any object with a split(text: str) -> list[str] method (SentenceSplitterProtocol) or a plain callable, which is auto-wrapped in FunctionSentenceSplitter.

include_offsets: bool = True#

min_length: int = 1#

multilang_config: MultilangConfig | None = None#

Multilang feature flags (MultilangConfig or None).

When set, each sentence chunk is enriched with a chunk.metadata["multilang"] dict containing script detection, grapheme counts, semanteme analysis, preprocessing trace, raw text, and timing provenance fields.

nltk_language: str | list[str] | None = 'english'#

Language(s) for the NLTK Punkt sentence tokenizer.

Accepts:

"english" — NLTK language name (backward-compatible default)
"en" — ISO 639-1 two-letter code, resolved automatically
["en", "de"] — multi-language: first NLTK-supported language used
None — auto-detect from text via detect_script

When a list is provided, the first NLTK-compatible language in the list is used (NLTK’s Punkt tokenizer handles one language per call). For documents with mixed languages, prefer backend=SentenceBackend.REGEX with script_hint=None (auto-detect) or SentenceBackend.CUSTOM with a language-aware splitter.

Supports 200+ languages via _language_data. ISO codes, NLTK names, and regional aliases (e.g. "chilean_spanish") all resolve.

overlap: int = 0#

script_hint: str | None = None#

Optional Unicode script hint for the REGEX backend.

When set to "multi" (or any non-None value), the REGEX backend uses MULTI_SCRIPT_SENTENCE_RE_PATTERN which covers CJK (。！？), Arabic (؟), Devanagari (।), Ethiopic (።), and Latin terminators. When None (default), the legacy Latin-only regex is used.

Valid values: None (Latin), "multi" (all scripts), or any ScriptType value string.

spacy_model: str | None = None#

strip_whitespace: bool = True#

Gallery examples#

corpus A Tale of Two Cities .mp3 with examples

corpus Knowledge and Information local .png with examples

corpus WHO European Region YouTube shorts with examples

corpus WHO European Region local .zip with examples