WordChunkerConfig#
- class scikitplot.corpus.WordChunkerConfig(tokenizer=TokenizerBackend.SIMPLE, custom_tokenizer=None, stemmer=StemmingBackend.NONE, custom_stemmer=None, lemmatizer=LemmatizationBackend.NONE, custom_lemmatizer=None, stopwords=StopwordSource.BUILTIN, custom_stopwords=None, spacy_model=None, nltk_language='english', lowercase=True, remove_punctuation=True, strip_unicode_punctuation=False, remove_numbers=False, min_token_length=2, max_token_length=None, ngram_range=(1, 1), chunk_by='document', include_offsets=False, build_gensim_corpus=False)[source]#
Configuration for
WordChunker.- Parameters:
- tokenizerTokenizerBackend
Word tokenisation strategy.
- custom_tokenizerTokenizerProtocol or Callable[[str], list[str]] or None
User-supplied tokenizer used when
tokenizer=TokenizerBackend.CUSTOM. Accepts any object satisfyingTokenizerProtocolor a plain callable. Callables are auto-wrapped inFunctionTokenizer. Example libraries: MeCab, jieba, camel-tools, Stanza, HuggingFace.- stemmerStemmingBackend
Stemming algorithm. Applied after lowercasing, before stopword removal. Mutually exclusive with lemmatizer (stemmer takes precedence when both are not
NONE).- custom_stemmerStemmerProtocol or Callable[[str], str] or None
User-supplied stemmer used when
stemmer=StemmingBackend.CUSTOM.- lemmatizerLemmatizationBackend
Lemmatization backend. Applied when stemmer is
NONE.- custom_lemmatizerLemmatizerProtocol or Callable or None
User-supplied lemmatizer used when
lemmatizer=LemmatizationBackend.CUSTOM.- stopwordsStopwordSource
Source of stopword list used for filtering.
- custom_stopwordsfrozenset[str] or None
Additional stopwords merged with the source list. Lowercasing is applied before membership testing, so case does not matter.
- spacy_modelstr or None
spaCy model name. Required for
SPACYtokenizer/lemmatizer.- nltk_languagestr
Language for NLTK stemmers and stopwords (e.g.
"english").- lowercasebool
Convert all tokens to lowercase before processing.
- remove_punctuationbool
Strip ASCII punctuation-only tokens.
- strip_unicode_punctuationbool
Strip all Unicode punctuation from tokens (superset of remove_punctuation). Handles CJK punctuation (
。!?), Arabic punctuation (،؟), and all otherunicodedataP*category characters. WhenTrue, remove_punctuation is implicitly satisfied and need not be set separately.- remove_numbersbool
Drop tokens that are purely numeric.
- min_token_lengthint
Drop tokens shorter than this (after normalisation).
- max_token_lengthint or None
Drop tokens longer than this.
Nonedisables the limit.- ngram_rangetuple[int, int]
Inclusive
(min_n, max_n)n-gram range to extract alongside unigrams.(1, 1)disables n-gram extraction.- chunk_bystr
Granularity of output
Chunkobjects."document"returns one chunk per input text."sentence"splits on sentence boundaries first, then processes each sentence as a separate chunk.- include_offsetsbool
Store character offsets in each chunk.
- build_gensim_corpusbool
If
True, attach agensim-compatible(token_id, count)BoW representation to each chunk’s metadata (requires Gensim).
- Parameters:
tokenizer (TokenizerBackend)
custom_tokenizer (Any)
stemmer (StemmingBackend)
custom_stemmer (Any)
lemmatizer (LemmatizationBackend)
custom_lemmatizer (Any)
stopwords (StopwordSource)
custom_stopwords (frozenset | None)
spacy_model (str | None)
lowercase (bool)
remove_punctuation (bool)
strip_unicode_punctuation (bool)
remove_numbers (bool)
min_token_length (int)
max_token_length (int | None)
ngram_range (tuple)
chunk_by (str)
include_offsets (bool)
build_gensim_corpus (bool)
Notes
User note (multi-language): For CJK text, set
tokenizer=TokenizerBackend.CUSTOMwith a character-level or morpheme-level tokenizer (jieba, MeCab, kss). Setremove_punctuation=False, strip_unicode_punctuation=Trueto strip CJK punctuation without removing ideographs.For Arabic / Ottoman / Persian, use
tokenizer=TokenizerBackend.CUSTOMwith camel-tools or Stanza. Setnltk_language="arabic"when using NLTK stopwords.Developer note: Callable fields (
custom_tokenizer,custom_stemmer,custom_lemmatizer) are excluded from__hash__and__eq__(hash=False, compare=False) so that two configs with identical settings but different callable objects are treated as equal for caching purposes. Compare callables explicitly when identity matters.- lemmatizer: LemmatizationBackend = 'none'[source]#
- nltk_language: str | list[str] | None = 'english'#
Language(s) for NLTK stopwords, Snowball stemmer, and NLTK tokenizer.
Accepts:
"en"or"english"— single language (backward-compatible)["en", "ar"]— multi-language: union stopwords for bothNone— auto-detect from text using detect_script
All ISO 639-1 codes (
"en","ar","hi", …) and NLTK names ("english","arabic", …) are accepted. Regional aliases such as"chilean_spanish","new_zealand_english", and"ottoman_turkish"are resolved automatically. 200+ languages via_language_data.
- stemmer: StemmingBackend = 'none'[source]#
- stopwords: StopwordSource = 'builtin'[source]#
- tokenizer: TokenizerBackend = 'simple'[source]#
Gallery examples#
corpus Knowledge and Information local .png with examples