WordChunkerConfig#

class scikitplot.corpus.WordChunkerConfig(tokenizer=TokenizerBackend.SIMPLE, stemmer=StemmingBackend.NONE, lemmatizer=LemmatizationBackend.NONE, stopwords=StopwordSource.BUILTIN, spacy_model=None, nltk_language='english', lowercase=True, remove_punctuation=True, remove_numbers=False, min_token_length=2, max_token_length=None, ngram_range=(1, 1), chunk_by='document', include_offsets=False, build_gensim_corpus=False)[source]#

Configuration for WordChunker.

Parameters:
tokenizerTokenizerBackend

Word tokenisation strategy.

stemmerStemmingBackend

Stemming algorithm. Applied after lowercasing, before stopword removal. Mutually exclusive with lemmatizer (stemmer takes precedence when both are not NONE).

lemmatizerLemmatizationBackend

Lemmatization backend. Applied when stemmer is NONE.

stopwordsStopwordSource

Source of stopword list used for filtering.

spacy_modelstr or None

spaCy model name. Required for SPACY tokenizer/lemmatizer.

nltk_languagestr

Language for NLTK stemmers and stopwords (e.g. "english").

lowercasebool

Convert all tokens to lowercase before processing.

remove_punctuationbool

Strip punctuation-only tokens.

remove_numbersbool

Drop tokens that are purely numeric.

min_token_lengthint

Drop tokens shorter than this (after normalisation).

max_token_lengthint or None

Drop tokens longer than this. None disables the limit.

ngram_rangetuple[int, int]

Inclusive (min_n, max_n) n-gram range to extract alongside unigrams. (1, 1) disables n-gram extraction.

chunk_bystr

Granularity of output Chunk objects. "document" returns one chunk per input text. "sentence" splits on sentence boundaries first, then processes each sentence as a separate chunk.

include_offsetsbool

Store character offsets in each chunk.

build_gensim_corpusbool

If True, attach a gensim-compatible (token_id, count) BoW representation to each chunk’s metadata (requires Gensim).

Parameters:
build_gensim_corpus: bool = False#
chunk_by: str = 'document'#
include_offsets: bool = False#
lemmatizer: LemmatizationBackend = 'none'[source]#
lowercase: bool = True#
max_token_length: int | None = None#
min_token_length: int = 2#
ngram_range: tuple[int, int] = (1, 1)#
nltk_language: str = 'english'#
remove_numbers: bool = False#
remove_punctuation: bool = True#
spacy_model: str | None = None#
stemmer: StemmingBackend = 'none'[source]#
stopwords: StopwordSource = 'builtin'[source]#
tokenizer: TokenizerBackend = 'simple'[source]#