EnricherConfig#

class scikitplot.corpus.EnricherConfig(language=None, tokenizer='simple', custom_tokenizer=None, spacy_model='en_core_web_sm', lemmatizer=None, custom_lemmatizer=None, stemmer=None, custom_stemmer=None, stemmer_language='english', keyword_extractor='frequency', keyword_extractor_kwargs=None, max_keywords=20, save_token_scores=False, lowercase_tokens=True, remove_stopwords=True, extra_stopwords=None, min_token_length=2, remove_punctuation=True, strip_unicode_punctuation=False, pos_tags=False, ner_entities=False, sentence_count=False, char_count=False, type_token_ratio=False)[source]#

Configuration for NLPEnricher.

Parameters:
languagestr or list[str] or None, optional

Language(s) to use for stopword loading and tokenisation. Accepts:

  • None — auto-detect per document from the text content

  • "en" — ISO 639-1 two-letter code, resolved to NLTK name

  • "english" — NLTK-style full language name

  • ["en", "ar"] — multi-language: union stopwords for both

Supports 200+ world languages via _language_data. See coerce_language for the full resolution chain, including regional aliases ("chilean_spanish""spanish", "new_zealand_english""english", etc.).

tokenizerstr

Tokenisation backend:

  • "simple" (default) — regex \\w+ (Unicode-aware, no deps)

  • "nltk"nltk.tokenize.word_tokenize

  • "spacy" — spaCy tokenizer (requires spacy_model)

  • "custom" — use custom_tokenizer

custom_tokenizercallable or TokenizerProtocol or None

User tokenizer for tokenizer="custom". Accepts any object with a tokenize(text: str) -> list[str] method, or a plain callable. Useful for MeCab (Japanese), jieba (Chinese), camel-tools (Arabic), Stanza (100+ languages), HuggingFace tokenizers, etc.

spacy_modelstr

spaCy model name, used for tokenizer="spacy" or lemmatizer="spacy" or pos_tags=True or ner_entities=True. Example: "en_core_web_sm".

lemmatizerstr or None

Lemmatisation backend: "spacy", "nltk", "custom", or None (skip).

custom_lemmatizercallable or LemmatizerProtocol or None

User lemmatizer for lemmatizer="custom". Must have a lemmatize(word: str, pos: str | None = None) -> str method, or be a plain callable.

stemmerstr or None

Stemming backend: "porter", "snowball", "lancaster", "custom", or None (skip).

custom_stemmercallable or StemmerProtocol or None

User stemmer for stemmer="custom". Must have a stem(word: str) -> str method, or be a plain callable.

stemmer_languagestr or list[str] or None

Language(s) for the Snowball stemmer. Accepts the same forms as language. Defaults to "english".

keyword_extractorstr or None

Keyword extraction backend: "frequency", "tfidf", "yake", "keybert", or None (skip).

  • "frequency" — top-N by raw term count (no deps)

  • "tfidf" — top-N by within-document TF-IDF score (no deps)

  • "yake" — unsupervised statistical (requires yake)

  • "keybert" — embedding-based (requires keybert)

keyword_extractor_kwargsdict or None

Extra kwargs forwarded to the keyword extractor (e.g. YAKE language setting, KeyBERT model name).

max_keywordsint

Maximum number of keywords to extract per document.

save_token_scoresbool

When True and keyword_extractor="tfidf", store per-token TF-IDF scores as a token_scores: dict in document metadata.

lowercase_tokensbool

Lowercase all tokens before further processing.

remove_stopwordsbool

Remove stopwords. Stopword language(s) follow language.

extra_stopwordsfrozenset[str] or None

Additional custom stopwords merged with the detected/specified list.

min_token_lengthint

Discard tokens shorter than this (after lowercasing).

remove_punctuationbool

Remove tokens that are entirely ASCII punctuation.

strip_unicode_punctuationbool

Remove Unicode punctuation characters from token text (superset of remove_punctuation; handles CJK 。!?, Arabic ،؟, etc.).

pos_tagsbool

When True, populate a pos_tags list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

ner_entitiesbool

When True, populate a ner_entities list in document metadata (requires tokenizer="spacy" or lemmatizer="spacy").

sentence_countbool

When True, compute and store the sentence count in document metadata (uses multi-script regex, no external deps).

char_countbool

When True, store raw character count in document metadata.

type_token_ratiobool

When True, store lexical diversity (unique/total tokens) in document metadata. Useful for LLM context quality assessment.

Parameters:
  • language (Any)

  • tokenizer (str)

  • custom_tokenizer (Any)

  • spacy_model (str)

  • lemmatizer (str | None)

  • custom_lemmatizer (Any)

  • stemmer (str | None)

  • custom_stemmer (Any)

  • stemmer_language (Any)

  • keyword_extractor (str | None)

  • keyword_extractor_kwargs (Any)

  • max_keywords (int)

  • save_token_scores (bool)

  • lowercase_tokens (bool)

  • remove_stopwords (bool)

  • extra_stopwords (Any)

  • min_token_length (int)

  • remove_punctuation (bool)

  • strip_unicode_punctuation (bool)

  • pos_tags (bool)

  • ner_entities (bool)

  • sentence_count (bool)

  • char_count (bool)

  • type_token_ratio (bool)

Notes

User note: For RAG pipelines:

  • tokenizer="simple" + keyword_extractor="tfidf" + no stemmer/lemmatizer is fast and works for all Latin-script languages.

  • For multilingual RAG: set language=["en", "ar"] and the enricher will union stopwords for both languages automatically.

  • For linguistic research: tokenizer="spacy" + lemmatizer="spacy" + pos_tags=True + ner_entities=True gives the richest output.

  • For LLM fine-tuning data: enable sentence_count, char_count, type_token_ratio, and save_token_scores to add quality signals to each document.

Developer note: All NLP backends are lazy-loaded and cached on NLPEnricher._* attributes. The class is NOT thread-safe. Use separate instances per thread.

char_count: bool = False#
custom_lemmatizer: Any = None#
custom_stemmer: Any = None#
custom_tokenizer: Any = None#
extra_stopwords: Any = None#
keyword_extractor: str | None = 'frequency'#
keyword_extractor_kwargs: Any = None#
language: Any = None#
lemmatizer: str | None = None#
lowercase_tokens: bool = True#
max_keywords: int = 20#
min_token_length: int = 2#
ner_entities: bool = False#
pos_tags: bool = False#
remove_punctuation: bool = True#
remove_stopwords: bool = True#
save_token_scores: bool = False#
sentence_count: bool = False#
spacy_model: str = 'en_core_web_sm'#
stemmer: str | None = None#
stemmer_language: Any = 'english'#
strip_unicode_punctuation: bool = False#
tokenizer: str = 'simple'#
type_token_ratio: bool = False#