WordChunkerConfig#
- class scikitplot.corpus.WordChunkerConfig(tokenizer=TokenizerBackend.SIMPLE, stemmer=StemmingBackend.NONE, lemmatizer=LemmatizationBackend.NONE, stopwords=StopwordSource.BUILTIN, spacy_model=None, nltk_language='english', lowercase=True, remove_punctuation=True, remove_numbers=False, min_token_length=2, max_token_length=None, ngram_range=(1, 1), chunk_by='document', include_offsets=False, build_gensim_corpus=False)[source]#
Configuration for
WordChunker.- Parameters:
- tokenizerTokenizerBackend
Word tokenisation strategy.
- stemmerStemmingBackend
Stemming algorithm. Applied after lowercasing, before stopword removal. Mutually exclusive with lemmatizer (stemmer takes precedence when both are not
NONE).- lemmatizerLemmatizationBackend
Lemmatization backend. Applied when stemmer is
NONE.- stopwordsStopwordSource
Source of stopword list used for filtering.
- spacy_modelstr or None
spaCy model name. Required for
SPACYtokenizer/lemmatizer.- nltk_languagestr
Language for NLTK stemmers and stopwords (e.g.
"english").- lowercasebool
Convert all tokens to lowercase before processing.
- remove_punctuationbool
Strip punctuation-only tokens.
- remove_numbersbool
Drop tokens that are purely numeric.
- min_token_lengthint
Drop tokens shorter than this (after normalisation).
- max_token_lengthint or None
Drop tokens longer than this.
Nonedisables the limit.- ngram_rangetuple[int, int]
Inclusive
(min_n, max_n)n-gram range to extract alongside unigrams.(1, 1)disables n-gram extraction.- chunk_bystr
Granularity of output
Chunkobjects."document"returns one chunk per input text."sentence"splits on sentence boundaries first, then processes each sentence as a separate chunk.- include_offsetsbool
Store character offsets in each chunk.
- build_gensim_corpusbool
If
True, attach agensim-compatible(token_id, count)BoW representation to each chunk’s metadata (requires Gensim).
- Parameters:
tokenizer (TokenizerBackend)
stemmer (StemmingBackend)
lemmatizer (LemmatizationBackend)
stopwords (StopwordSource)
spacy_model (str | None)
nltk_language (str)
lowercase (bool)
remove_punctuation (bool)
remove_numbers (bool)
min_token_length (int)
max_token_length (int | None)
chunk_by (str)
include_offsets (bool)
build_gensim_corpus (bool)
- lemmatizer: LemmatizationBackend = 'none'[source]#
- stemmer: StemmingBackend = 'none'[source]#
- stopwords: StopwordSource = 'builtin'[source]#
- tokenizer: TokenizerBackend = 'simple'[source]#