EnricherConfig#

class scikitplot.corpus.EnricherConfig(tokenizer='simple', spacy_model='en_core_web_sm', lemmatizer=None, stemmer=None, stemmer_language='english', keyword_extractor='frequency', max_keywords=20, lowercase_tokens=True, remove_stopwords=True, min_token_length=2, remove_punctuation=True)[source]#

Configuration for NLPEnricher.

Parameters:
tokenizerstr

Tokenisation backend: "simple" (regex \\w+), "nltk" (nltk.tokenize.word_tokenize), or "spacy" (spaCy tokenizer).

spacy_modelstr

spaCy model name, used when tokenizer="spacy" or lemmatizer="spacy".

lemmatizerstr or None

Lemmatisation backend: "spacy", "nltk" (WordNetLemmatizer), or None (skip).

stemmerstr or None

Stemming backend: "porter", "snowball", "lancaster", or None (skip).

stemmer_languagestr

Language for Snowball stemmer.

keyword_extractorstr or None

Keyword extraction backend: "frequency" (top-N by term frequency), "yake", "keybert", or None (skip).

max_keywordsint

Maximum keywords to extract per document.

lowercase_tokensbool

Lowercase all tokens before further processing.

remove_stopwordsbool

Remove stopwords. Uses NLTK’s English stopword list when available, otherwise a small built-in set.

min_token_lengthint

Discard tokens shorter than this (after lowercasing).

remove_punctuationbool

Remove tokens that are entirely punctuation.

Parameters:
  • tokenizer (str)

  • spacy_model (str)

  • lemmatizer (str | None)

  • stemmer (str | None)

  • stemmer_language (str)

  • keyword_extractor (str | None)

  • max_keywords (int)

  • lowercase_tokens (bool)

  • remove_stopwords (bool)

  • min_token_length (int)

  • remove_punctuation (bool)

Notes

User note: For RAG pipelines, tokenizer="simple" with keyword_extractor="frequency" is usually sufficient. For linguistic research, use "spacy" with lemmatizer="spacy" for best accuracy.

keyword_extractor: str | None = 'frequency'#
lemmatizer: str | None = None#
lowercase_tokens: bool = True#
max_keywords: int = 20#
min_token_length: int = 2#
remove_punctuation: bool = True#
remove_stopwords: bool = True#
spacy_model: str = 'en_core_web_sm'#
stemmer: str | None = None#
stemmer_language: str = 'english'#
tokenizer: str = 'simple'#