WordChunkerConfig#

class scikitplot.corpus.WordChunkerConfig(tokenizer=TokenizerBackend.SIMPLE, custom_tokenizer=None, stemmer=StemmingBackend.NONE, custom_stemmer=None, lemmatizer=LemmatizationBackend.NONE, custom_lemmatizer=None, stopwords=StopwordSource.BUILTIN, custom_stopwords=None, spacy_model=None, nltk_language='english', lowercase=True, remove_punctuation=True, strip_unicode_punctuation=False, remove_numbers=False, min_token_length=2, max_token_length=None, ngram_range=(1, 1), chunk_by='document', include_offsets=False, build_gensim_corpus=False)[source]#

Configuration for WordChunker.

Parameters:
tokenizerTokenizerBackend

Word tokenisation strategy.

custom_tokenizerTokenizerProtocol or Callable[[str], list[str]] or None

User-supplied tokenizer used when tokenizer=TokenizerBackend.CUSTOM. Accepts any object satisfying TokenizerProtocol or a plain callable. Callables are auto-wrapped in FunctionTokenizer. Example libraries: MeCab, jieba, camel-tools, Stanza, HuggingFace.

stemmerStemmingBackend

Stemming algorithm. Applied after lowercasing, before stopword removal. Mutually exclusive with lemmatizer (stemmer takes precedence when both are not NONE).

custom_stemmerStemmerProtocol or Callable[[str], str] or None

User-supplied stemmer used when stemmer=StemmingBackend.CUSTOM.

lemmatizerLemmatizationBackend

Lemmatization backend. Applied when stemmer is NONE.

custom_lemmatizerLemmatizerProtocol or Callable or None

User-supplied lemmatizer used when lemmatizer=LemmatizationBackend.CUSTOM.

stopwordsStopwordSource

Source of stopword list used for filtering.

custom_stopwordsfrozenset[str] or None

Additional stopwords merged with the source list. Lowercasing is applied before membership testing, so case does not matter.

spacy_modelstr or None

spaCy model name. Required for SPACY tokenizer/lemmatizer.

nltk_languagestr

Language for NLTK stemmers and stopwords (e.g. "english").

lowercasebool

Convert all tokens to lowercase before processing.

remove_punctuationbool

Strip ASCII punctuation-only tokens.

strip_unicode_punctuationbool

Strip all Unicode punctuation from tokens (superset of remove_punctuation). Handles CJK punctuation (。!?), Arabic punctuation (،؟), and all other unicodedata P* category characters. When True, remove_punctuation is implicitly satisfied and need not be set separately.

remove_numbersbool

Drop tokens that are purely numeric.

min_token_lengthint

Drop tokens shorter than this (after normalisation).

max_token_lengthint or None

Drop tokens longer than this. None disables the limit.

ngram_rangetuple[int, int]

Inclusive (min_n, max_n) n-gram range to extract alongside unigrams. (1, 1) disables n-gram extraction.

chunk_bystr

Granularity of output Chunk objects. "document" returns one chunk per input text. "sentence" splits on sentence boundaries first, then processes each sentence as a separate chunk.

include_offsetsbool

Store character offsets in each chunk.

build_gensim_corpusbool

If True, attach a gensim-compatible (token_id, count) BoW representation to each chunk’s metadata (requires Gensim).

Parameters:

Notes

User note (multi-language): For CJK text, set tokenizer=TokenizerBackend.CUSTOM with a character-level or morpheme-level tokenizer (jieba, MeCab, kss). Set remove_punctuation=False, strip_unicode_punctuation=True to strip CJK punctuation without removing ideographs.

For Arabic / Ottoman / Persian, use tokenizer=TokenizerBackend.CUSTOM with camel-tools or Stanza. Set nltk_language="arabic" when using NLTK stopwords.

Developer note: Callable fields (custom_tokenizer, custom_stemmer, custom_lemmatizer) are excluded from __hash__ and __eq__ (hash=False, compare=False) so that two configs with identical settings but different callable objects are treated as equal for caching purposes. Compare callables explicitly when identity matters.

build_gensim_corpus: bool = False#
chunk_by: str = 'document'#
custom_lemmatizer: Any = None#
custom_stemmer: Any = None#
custom_stopwords: frozenset | None = None#
custom_tokenizer: Any = None#
include_offsets: bool = False#
lemmatizer: LemmatizationBackend = 'none'[source]#
lowercase: bool = True#
max_token_length: int | None = None#
min_token_length: int = 2#
ngram_range: tuple = (1, 1)#
nltk_language: str | list[str] | None = 'english'#

Language(s) for NLTK stopwords, Snowball stemmer, and NLTK tokenizer.

Accepts:

  • "en" or "english" — single language (backward-compatible)

  • ["en", "ar"] — multi-language: union stopwords for both

  • None — auto-detect from text using detect_script

All ISO 639-1 codes ("en", "ar", "hi", …) and NLTK names ("english", "arabic", …) are accepted. Regional aliases such as "chilean_spanish", "new_zealand_english", and "ottoman_turkish" are resolved automatically. 200+ languages via _language_data.

remove_numbers: bool = False#
remove_punctuation: bool = True#
spacy_model: str | None = None#
stemmer: StemmingBackend = 'none'[source]#
stopwords: StopwordSource = 'builtin'[source]#
strip_unicode_punctuation: bool = False#
tokenizer: TokenizerBackend = 'simple'[source]#