EnricherConfig#

class scikitplot.corpus.EnricherConfig(language=None, tokenizer='simple', custom_tokenizer=None, spacy_model='en_core_web_sm', lemmatizer=None, custom_lemmatizer=None, stemmer=None, custom_stemmer=None, stemmer_language='english', keyword_extractor='frequency', keyword_extractor_kwargs=None, max_keywords=20, save_token_scores=False, lowercase_tokens=True, remove_stopwords=True, extra_stopwords=None, min_token_length=2, remove_punctuation=True, strip_unicode_punctuation=False, pos_tags=False, ner_entities=False, sentence_count=False, char_count=False, type_token_ratio=False)[source]#

Configuration for NLPEnricher.

Parameters:

languagestr or list[str] or None, optional

Language(s) to use for stopword loading and tokenisation. Accepts:

None — auto-detect per document from the text content
"en" — ISO 639-1 two-letter code, resolved to NLTK name
"english" — NLTK-style full language name
["en", "ar"] — multi-language: union stopwords for both

Supports 200+ world languages via _language_data. See coerce_language for the full resolution chain, including regional aliases ("chilean_spanish" → "spanish", "new_zealand_english" → "english", etc.).

tokenizerstr

Tokenisation backend:

"simple" (default) — regex \\w+ (Unicode-aware, no deps)
"nltk" — nltk.tokenize.word_tokenize
"spacy" — spaCy tokenizer (requires spacy_model)
"custom" — use custom_tokenizer

custom_tokenizercallable or TokenizerProtocol or None

User tokenizer for tokenizer="custom". Accepts any object with a tokenize(text: str) -> list[str] method, or a plain callable. Useful for MeCab (Japanese), jieba (Chinese), camel-tools (Arabic), Stanza (100+ languages), HuggingFace tokenizers, etc.

spacy_modelstr

spaCy model name, used for tokenizer="spacy" or lemmatizer="spacy" or pos_tags=True or ner_entities=True. Example: "en_core_web_sm".

lemmatizerstr or None

Lemmatisation backend: "spacy", "nltk", "custom", or None (skip).

custom_lemmatizercallable or LemmatizerProtocol or None

User lemmatizer for lemmatizer="custom". Must have a lemmatize(word: str, pos: str | None = None) -> str method, or be a plain callable.

stemmerstr or None

Stemming backend: "porter", "snowball", "lancaster", "custom", or None (skip).

custom_stemmercallable or StemmerProtocol or None

User stemmer for stemmer="custom". Must have a stem(word: str) -> str method, or be a plain callable.

stemmer_languagestr or list[str] or None

Language(s) for the Snowball stemmer. Accepts the same forms as language. Defaults to "english".

keyword_extractorstr or None

Keyword extraction backend: "frequency", "tfidf", "yake", "keybert", or None (skip).

"frequency" — top-N by raw term count (no deps)
"tfidf" — top-N by within-document TF-IDF score (no deps)
"yake" — unsupervised statistical (requires yake)
"keybert" — embedding-based (requires keybert)

keyword_extractor_kwargsdict or None

Extra kwargs forwarded to the keyword extractor (e.g. YAKE language setting, KeyBERT model name).

max_keywordsint

Maximum number of keywords to extract per document.

save_token_scoresbool

When True and keyword_extractor="tfidf", store per-token TF-IDF scores as a token_scores: dict in document metadata.

lowercase_tokensbool

Lowercase all tokens before further processing.

remove_stopwordsbool

Remove stopwords. Stopword language(s) follow language.

extra_stopwordsfrozenset[str] or None

Additional custom stopwords merged with the detected/specified list.

min_token_lengthint

Discard tokens shorter than this (after lowercasing).

remove_punctuationbool

Remove tokens that are entirely ASCII punctuation.

strip_unicode_punctuationbool

Remove Unicode punctuation characters from token text (superset of remove_punctuation; handles CJK 。！？, Arabic ،؟, etc.).

pos_tagsbool

When True, populate a pos_tags list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

ner_entitiesbool

When True, populate a ner_entities list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

sentence_countbool

When True, compute and store the sentence count in document metadata (uses multi-script regex, no external deps).

char_countbool

When True, store raw character count in document metadata.

type_token_ratiobool

When True, store lexical diversity (unique/total tokens) in document metadata. Useful for LLM context quality assessment.

Parameters:

language (str | list[str] | None)
tokenizer (Literal['simple', 'nltk', 'spacy', 'custom'])
custom_tokenizer (TokenizerProtocol | Callable[[str], list[str]] | None)
spacy_model (str)
lemmatizer (Literal['spacy', 'nltk', 'custom'] | None)
custom_lemmatizer (LemmatizerProtocol | Callable[[str, str | None], str] | None)
stemmer (Literal['porter', 'snowball', 'lancaster', 'custom'] | None)
custom_stemmer (StemmerProtocol | Callable[[str], str] | None)
stemmer_language (str | list[str] | None)
keyword_extractor (Literal['frequency', 'tfidf', 'yake', 'keybert'] | None)
keyword_extractor_kwargs (dict[str, Any] | None)
max_keywords (int)
save_token_scores (bool)
lowercase_tokens (bool)
remove_stopwords (bool)
extra_stopwords (frozenset[str] | None)
min_token_length (int)
remove_punctuation (bool)
strip_unicode_punctuation (bool)
pos_tags (bool)
ner_entities (bool)
sentence_count (bool)
char_count (bool)
type_token_ratio (bool)

Notes

User note: For RAG pipelines:

tokenizer="simple" + keyword_extractor="tfidf" + no stemmer/lemmatizer is fast and works for all Latin-script languages.
For multilingual RAG: set language=["en", "ar"] and the enricher will union stopwords for both languages automatically.
For linguistic research: tokenizer="spacy" + lemmatizer="spacy" + pos_tags=True + ner_entities=True gives the richest output.
For LLM fine-tuning data: enable sentence_count, char_count, type_token_ratio, and save_token_scores to add quality signals to each document.

Developer note: All NLP backends are lazy-loaded and cached on NLPEnricher._* attributes. The class is NOT thread-safe. Use separate instances per thread.