WordChunkerConfig#

class scikitplot.corpus.WordChunkerConfig(tokenizer=TokenizerBackend.SIMPLE, custom_tokenizer=None, stemmer=StemmingBackend.NONE, custom_stemmer=None, lemmatizer=LemmatizationBackend.NONE, custom_lemmatizer=None, stopwords=StopwordSource.BUILTIN, custom_stopwords=None, spacy_model=None, nltk_language='english', lowercase=True, remove_punctuation=True, strip_unicode_punctuation=False, remove_numbers=False, min_token_length=2, max_token_length=None, ngram_range=(1, 1), chunk_by='document', include_offsets=False, build_gensim_corpus=False)[source]#

Configuration for WordChunker.

Parameters:

tokenizerTokenizerBackend: Word tokenisation strategy.
custom_tokenizerTokenizerProtocol or Callable[[str], list[str]] or None: User-supplied tokenizer used when tokenizer=TokenizerBackend.CUSTOM. Accepts any object satisfying TokenizerProtocol or a plain callable. Callables are auto-wrapped in FunctionTokenizer. Example libraries: MeCab, jieba, camel-tools, Stanza, HuggingFace.
stemmerStemmingBackend: Stemming algorithm. Applied after lowercasing, before stopword removal. Mutually exclusive with lemmatizer (stemmer takes precedence when both are not NONE).
custom_stemmerStemmerProtocol or Callable[[str], str] or None: User-supplied stemmer used when stemmer=StemmingBackend.CUSTOM.
lemmatizerLemmatizationBackend: Lemmatization backend. Applied when stemmer is NONE.
custom_lemmatizerLemmatizerProtocol or Callable or None: User-supplied lemmatizer used when lemmatizer=LemmatizationBackend.CUSTOM.
stopwordsStopwordSource: Source of stopword list used for filtering.
custom_stopwordsfrozenset[str] or None: Additional stopwords merged with the source list. Lowercasing is applied before membership testing, so case does not matter.
spacy_modelstr or None: spaCy model name. Required for SPACY tokenizer/lemmatizer.
nltk_languagestr: Language for NLTK stemmers and stopwords (e.g. "english").
lowercasebool: Convert all tokens to lowercase before processing.
remove_punctuationbool: Strip ASCII punctuation-only tokens.
strip_unicode_punctuationbool: Strip all Unicode punctuation from tokens (superset of remove_punctuation). Handles CJK punctuation (。！？), Arabic punctuation (،؟), and all other unicodedata P* category characters. When True, remove_punctuation is implicitly satisfied and need not be set separately.
remove_numbersbool: Drop tokens that are purely numeric.
min_token_lengthint: Drop tokens shorter than this (after normalisation).
max_token_lengthint or None: Drop tokens longer than this. None disables the limit.
ngram_rangetuple[int, int]: Inclusive (min_n, max_n) n-gram range to extract alongside unigrams. (1, 1) disables n-gram extraction.
chunk_bystr: Granularity of output Chunk objects. "document" returns one chunk per input text. "sentence" splits on sentence boundaries first, then processes each sentence as a separate chunk.
include_offsetsbool: Store character offsets in each chunk.
build_gensim_corpusbool: If True, attach a gensim-compatible (token_id, count) BoW representation to each chunk’s metadata (requires Gensim).

Parameters:

tokenizer (TokenizerBackend)
custom_tokenizer (Any)
stemmer (StemmingBackend)
custom_stemmer (Any)
lemmatizer (LemmatizationBackend)
custom_lemmatizer (Any)
stopwords (StopwordSource)
custom_stopwords (frozenset | None)
spacy_model (str | None)
nltk_language (str | list[str] | None)
lowercase (bool)
remove_punctuation (bool)
strip_unicode_punctuation (bool)
remove_numbers (bool)
min_token_length (int)
max_token_length (int | None)
ngram_range (tuple)
chunk_by (str)
include_offsets (bool)
build_gensim_corpus (bool)

Notes

User note (multi-language): For CJK text, set tokenizer=TokenizerBackend.CUSTOM with a character-level or morpheme-level tokenizer (jieba, MeCab, kss). Set remove_punctuation=False, strip_unicode_punctuation=True to strip CJK punctuation without removing ideographs.

For Arabic / Ottoman / Persian, use tokenizer=TokenizerBackend.CUSTOM with camel-tools or Stanza. Set nltk_language="arabic" when using NLTK stopwords.

Developer note: Callable fields (custom_tokenizer, custom_stemmer, custom_lemmatizer) are excluded from __hash__ and __eq__ (hash=False, compare=False) so that two configs with identical settings but different callable objects are treated as equal for caching purposes. Compare callables explicitly when identity matters.

build_gensim_corpus: bool = False#

chunk_by: str = 'document'#

custom_lemmatizer: Any = None#

custom_stemmer: Any = None#

custom_stopwords: frozenset | None = None#

custom_tokenizer: Any = None#

include_offsets: bool = False#

lemmatizer: LemmatizationBackend = 'none'[source]#

lowercase: bool = True#

max_token_length: int | None = None#

min_token_length: int = 2#

ngram_range: tuple = (1, 1)#

nltk_language: str | list[str] | None = 'english'#

Language(s) for NLTK stopwords, Snowball stemmer, and NLTK tokenizer.

Accepts:

"en" or "english" — single language (backward-compatible)
["en", "ar"] — multi-language: union stopwords for both
None — auto-detect from text using detect_script

All ISO 639-1 codes ("en", "ar", "hi", …) and NLTK names ("english", "arabic", …) are accepted. Regional aliases such as "chilean_spanish", "new_zealand_english", and "ottoman_turkish" are resolved automatically. 200+ languages via _language_data.

remove_numbers: bool = False#

remove_punctuation: bool = True#

spacy_model: str | None = None#

stemmer: StemmingBackend = 'none'[source]#

stopwords: StopwordSource = 'builtin'[source]#

strip_unicode_punctuation: bool = False#

tokenizer: TokenizerBackend = 'simple'[source]#

Gallery examples#

corpus Knowledge and Information local .png with examples