EnricherConfig#
- class scikitplot.corpus.EnricherConfig(language=None, tokenizer='simple', custom_tokenizer=None, spacy_model='en_core_web_sm', lemmatizer=None, custom_lemmatizer=None, stemmer=None, custom_stemmer=None, stemmer_language='english', keyword_extractor='frequency', keyword_extractor_kwargs=None, max_keywords=20, save_token_scores=False, lowercase_tokens=True, remove_stopwords=True, extra_stopwords=None, min_token_length=2, remove_punctuation=True, strip_unicode_punctuation=False, pos_tags=False, ner_entities=False, sentence_count=False, char_count=False, type_token_ratio=False)[source]#
Configuration for
NLPEnricher.- Parameters:
- languagestr or list[str] or None, optional
Language(s) to use for stopword loading and tokenisation. Accepts:
None— auto-detect per document from the text content"en"— ISO 639-1 two-letter code, resolved to NLTK name"english"— NLTK-style full language name["en", "ar"]— multi-language: union stopwords for both
Supports 200+ world languages via
_language_data. Seecoerce_languagefor the full resolution chain, including regional aliases ("chilean_spanish"→"spanish","new_zealand_english"→"english", etc.).- tokenizerstr
Tokenisation backend:
"simple"(default) — regex\\w+(Unicode-aware, no deps)"nltk"—nltk.tokenize.word_tokenize"spacy"— spaCy tokenizer (requiresspacy_model)"custom"— usecustom_tokenizer
- custom_tokenizercallable or TokenizerProtocol or None
User tokenizer for
tokenizer="custom". Accepts any object with atokenize(text: str) -> list[str]method, or a plain callable. Useful for MeCab (Japanese), jieba (Chinese), camel-tools (Arabic), Stanza (100+ languages), HuggingFace tokenizers, etc.- spacy_modelstr
spaCy model name, used for
tokenizer="spacy"orlemmatizer="spacy"orpos_tags=Trueorner_entities=True. Example:"en_core_web_sm".- lemmatizerstr or None
Lemmatisation backend:
"spacy","nltk","custom", orNone(skip).- custom_lemmatizercallable or LemmatizerProtocol or None
User lemmatizer for
lemmatizer="custom". Must have alemmatize(word: str, pos: str | None = None) -> strmethod, or be a plain callable.- stemmerstr or None
Stemming backend:
"porter","snowball","lancaster","custom", orNone(skip).- custom_stemmercallable or StemmerProtocol or None
User stemmer for
stemmer="custom". Must have astem(word: str) -> strmethod, or be a plain callable.- stemmer_languagestr or list[str] or None
Language(s) for the Snowball stemmer. Accepts the same forms as language. Defaults to
"english".- keyword_extractorstr or None
Keyword extraction backend:
"frequency","tfidf","yake","keybert", orNone(skip)."frequency"— top-N by raw term count (no deps)"tfidf"— top-N by within-document TF-IDF score (no deps)"yake"— unsupervised statistical (requiresyake)"keybert"— embedding-based (requireskeybert)
- keyword_extractor_kwargsdict or None
Extra kwargs forwarded to the keyword extractor (e.g. YAKE language setting, KeyBERT model name).
- max_keywordsint
Maximum number of keywords to extract per document.
- save_token_scoresbool
When
Trueandkeyword_extractor="tfidf", store per-token TF-IDF scores as atoken_scores: dictin document metadata.- lowercase_tokensbool
Lowercase all tokens before further processing.
- remove_stopwordsbool
Remove stopwords. Stopword language(s) follow language.
- extra_stopwordsfrozenset[str] or None
Additional custom stopwords merged with the detected/specified list.
- min_token_lengthint
Discard tokens shorter than this (after lowercasing).
- remove_punctuationbool
Remove tokens that are entirely ASCII punctuation.
- strip_unicode_punctuationbool
Remove Unicode punctuation characters from token text (superset of remove_punctuation; handles CJK
。!?, Arabic،؟, etc.).- pos_tagsbool
When
True, populate apos_tagslist in document metadata (requirestokenizer="spacy"orlemmatizer="spacy").- ner_entitiesbool
When
True, populate aner_entitieslist in document metadata (requirestokenizer="spacy"orlemmatizer="spacy").- sentence_countbool
When
True, compute and store the sentence count in document metadata (uses multi-script regex, no external deps).- char_countbool
When
True, store raw character count in document metadata.- type_token_ratiobool
When
True, store lexical diversity (unique/total tokens) in document metadata. Useful for LLM context quality assessment.
- Parameters:
language (Any)
tokenizer (str)
custom_tokenizer (Any)
spacy_model (str)
lemmatizer (str | None)
custom_lemmatizer (Any)
stemmer (str | None)
custom_stemmer (Any)
stemmer_language (Any)
keyword_extractor (str | None)
keyword_extractor_kwargs (Any)
max_keywords (int)
save_token_scores (bool)
lowercase_tokens (bool)
remove_stopwords (bool)
extra_stopwords (Any)
min_token_length (int)
remove_punctuation (bool)
strip_unicode_punctuation (bool)
pos_tags (bool)
ner_entities (bool)
sentence_count (bool)
char_count (bool)
type_token_ratio (bool)
Notes
User note: For RAG pipelines:
tokenizer="simple"+keyword_extractor="tfidf"+ no stemmer/lemmatizer is fast and works for all Latin-script languages.For multilingual RAG: set
language=["en", "ar"]and the enricher will union stopwords for both languages automatically.For linguistic research:
tokenizer="spacy"+lemmatizer="spacy"+pos_tags=True+ner_entities=Truegives the richest output.For LLM fine-tuning data: enable
sentence_count,char_count,type_token_ratio, andsave_token_scoresto add quality signals to each document.
Developer note: All NLP backends are lazy-loaded and cached on
NLPEnricher._*attributes. The class is NOT thread-safe. Use separate instances per thread.
Gallery examples#
corpus WHO European Region local or url per file with examples