EnricherConfig#

class scikitplot.corpus.EnricherConfig(language=None, tokenizer='simple', custom_tokenizer=None, spacy_model='en_core_web_sm', lemmatizer=None, custom_lemmatizer=None, stemmer=None, custom_stemmer=None, stemmer_language='english', keyword_extractor='frequency', keyword_extractor_kwargs=None, max_keywords=20, save_token_scores=False, lowercase_tokens=True, remove_stopwords=True, extra_stopwords=None, min_token_length=2, remove_punctuation=True, strip_unicode_punctuation=False, pos_tags=False, ner_entities=False, sentence_count=False, char_count=False, type_token_ratio=False)[source]#

Configuration for NLPEnricher.

Parameters:

languagestr or list[str] or None, optional

Language(s) to use for stopword loading and tokenisation. Accepts:

None — auto-detect per document from the text content
"en" — ISO 639-1 two-letter code, resolved to NLTK name
"english" — NLTK-style full language name
["en", "ar"] — multi-language: union stopwords for both

Supports 200+ world languages via _language_data. See coerce_language for the full resolution chain, including regional aliases ("chilean_spanish" → "spanish", "new_zealand_english" → "english", etc.).

tokenizerstr

Tokenisation backend:

"simple" (default) — regex \\w+ (Unicode-aware, no deps)
"nltk" — nltk.tokenize.word_tokenize
"spacy" — spaCy tokenizer (requires spacy_model)
"custom" — use custom_tokenizer

custom_tokenizercallable or TokenizerProtocol or None

User tokenizer for tokenizer="custom". Accepts any object with a tokenize(text: str) -> list[str] method, or a plain callable. Useful for MeCab (Japanese), jieba (Chinese), camel-tools (Arabic), Stanza (100+ languages), HuggingFace tokenizers, etc.

spacy_modelstr

spaCy model name, used for tokenizer="spacy" or lemmatizer="spacy" or pos_tags=True or ner_entities=True. Example: "en_core_web_sm".

lemmatizerstr or None

Lemmatisation backend: "spacy", "nltk", "custom", or None (skip).

custom_lemmatizercallable or LemmatizerProtocol or None

User lemmatizer for lemmatizer="custom". Must have a lemmatize(word: str, pos: str | None = None) -> str method, or be a plain callable.

stemmerstr or None

Stemming backend: "porter", "snowball", "lancaster", "custom", or None (skip).

custom_stemmercallable or StemmerProtocol or None

User stemmer for stemmer="custom". Must have a stem(word: str) -> str method, or be a plain callable.

stemmer_languagestr or list[str] or None

Language(s) for the Snowball stemmer. Accepts the same forms as language. Defaults to "english".

keyword_extractorstr or None

Keyword extraction backend: "frequency", "tfidf", "yake", "keybert", or None (skip).

"frequency" — top-N by raw term count (no deps)
"tfidf" — top-N by within-document TF-IDF score (no deps)
"yake" — unsupervised statistical (requires yake)
"keybert" — embedding-based (requires keybert)

keyword_extractor_kwargsdict or None

Extra kwargs forwarded to the keyword extractor (e.g. YAKE language setting, KeyBERT model name).

max_keywordsint

Maximum number of keywords to extract per document.

save_token_scoresbool

When True and keyword_extractor="tfidf", store per-token TF-IDF scores as a token_scores: dict in document metadata.

lowercase_tokensbool

Lowercase all tokens before further processing.

remove_stopwordsbool

Remove stopwords. Stopword language(s) follow language.

extra_stopwordsfrozenset[str] or None

Additional custom stopwords merged with the detected/specified list.

min_token_lengthint

Discard tokens shorter than this (after lowercasing).

remove_punctuationbool

Remove tokens that are entirely ASCII punctuation.

strip_unicode_punctuationbool

Remove Unicode punctuation characters from token text (superset of remove_punctuation; handles CJK 。！？, Arabic ،؟, etc.).

pos_tagsbool

When True, populate a pos_tags list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

ner_entitiesbool

When True, populate a ner_entities list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

sentence_countbool

When True, compute and store the sentence count in document metadata (uses multi-script regex, no external deps).

char_countbool

When True, store raw character count in document metadata.

type_token_ratiobool

When True, store lexical diversity (unique/total tokens) in document metadata. Useful for LLM context quality assessment.

Parameters:

language (Any)
tokenizer (str)
custom_tokenizer (Any)
spacy_model (str)
lemmatizer (str | None)
custom_lemmatizer (Any)
stemmer (str | None)
custom_stemmer (Any)
stemmer_language (Any)
keyword_extractor (str | None)
keyword_extractor_kwargs (Any)
max_keywords (int)
save_token_scores (bool)
lowercase_tokens (bool)
remove_stopwords (bool)
extra_stopwords (Any)
min_token_length (int)
remove_punctuation (bool)
strip_unicode_punctuation (bool)
pos_tags (bool)
ner_entities (bool)
sentence_count (bool)
char_count (bool)
type_token_ratio (bool)

Notes

User note: For RAG pipelines:

tokenizer="simple" + keyword_extractor="tfidf" + no stemmer/lemmatizer is fast and works for all Latin-script languages.
For multilingual RAG: set language=["en", "ar"] and the enricher will union stopwords for both languages automatically.
For linguistic research: tokenizer="spacy" + lemmatizer="spacy" + pos_tags=True + ner_entities=True gives the richest output.
For LLM fine-tuning data: enable sentence_count, char_count, type_token_ratio, and save_token_scores to add quality signals to each document.

Developer note: All NLP backends are lazy-loaded and cached on NLPEnricher._* attributes. The class is NOT thread-safe. Use separate instances per thread.

char_count: bool = False#

custom_lemmatizer: Any = None#

custom_stemmer: Any = None#

custom_tokenizer: Any = None#

extra_stopwords: Any = None#

keyword_extractor: str | None = 'frequency'#

keyword_extractor_kwargs: Any = None#

language: Any = None#

lemmatizer: str | None = None#

lowercase_tokens: bool = True#

max_keywords: int = 20#

min_token_length: int = 2#

ner_entities: bool = False#

pos_tags: bool = False#

remove_punctuation: bool = True#

remove_stopwords: bool = True#

save_token_scores: bool = False#

sentence_count: bool = False#

spacy_model: str = 'en_core_web_sm'#

stemmer: str | None = None#

stemmer_language: Any = 'english'#

strip_unicode_punctuation: bool = False#

tokenizer: str = 'simple'#

type_token_ratio: bool = False#

Gallery examples#

corpus A Tale of Two Cities .mp3 with examples

corpus WHO European Region local or url per file with examples