EnricherConfig#
- class scikitplot.corpus.EnricherConfig(tokenizer='simple', spacy_model='en_core_web_sm', lemmatizer=None, stemmer=None, stemmer_language='english', keyword_extractor='frequency', max_keywords=20, lowercase_tokens=True, remove_stopwords=True, min_token_length=2, remove_punctuation=True)[source]#
Configuration for
NLPEnricher.- Parameters:
- tokenizerstr
Tokenisation backend:
"simple"(regex\\w+),"nltk"(nltk.tokenize.word_tokenize), or"spacy"(spaCy tokenizer).- spacy_modelstr
spaCy model name, used when
tokenizer="spacy"orlemmatizer="spacy".- lemmatizerstr or None
Lemmatisation backend:
"spacy","nltk"(WordNetLemmatizer), orNone(skip).- stemmerstr or None
Stemming backend:
"porter","snowball","lancaster", orNone(skip).- stemmer_languagestr
Language for Snowball stemmer.
- keyword_extractorstr or None
Keyword extraction backend:
"frequency"(top-N by term frequency),"yake","keybert", orNone(skip).- max_keywordsint
Maximum keywords to extract per document.
- lowercase_tokensbool
Lowercase all tokens before further processing.
- remove_stopwordsbool
Remove stopwords. Uses NLTK’s English stopword list when available, otherwise a small built-in set.
- min_token_lengthint
Discard tokens shorter than this (after lowercasing).
- remove_punctuationbool
Remove tokens that are entirely punctuation.
- Parameters:
Notes
User note: For RAG pipelines,
tokenizer="simple"withkeyword_extractor="frequency"is usually sufficient. For linguistic research, use"spacy"withlemmatizer="spacy"for best accuracy.