NLPEnricher#

class scikitplot.corpus.NLPEnricher(config=None)[source]#

Pipeline component that populates NLP enrichment fields on CorpusDocument.

Parameters:
configEnricherConfig or None, optional

Enrichment settings. None uses all defaults.

Parameters:

config (EnricherConfig | None)

Notes

User note: Insert after TextNormalizer and before EmbeddingEngine in the pipeline:

source → reader → chunker → filter → normalizer
  → **enricher** → embedder

The enricher reads doc.normalized_text when available, falling back to doc.text. When language=None, the dominant script of each document is detected independently via detect_script.

Developer note: All NLP backends are lazy-loaded and cached on self._* attributes. The class is NOT thread-safe. Use separate instances per thread.

Examples

>>> cfg = EnricherConfig(
...     language=["en", "ar"],
...     keyword_extractor="tfidf",
...     sentence_count=True,
...     char_count=True,
...     save_token_scores=True,
... )
>>> enricher = NLPEnricher(cfg)
>>> # docs = enricher.enrich_documents([doc1, doc2])
enrich_documents(documents, *, overwrite=False)[source]#

Enrich a batch of CorpusDocument instances.

Parameters:
documentsSequence[CorpusDocument]

Documents to enrich. Original objects are not mutated; new instances are returned via doc.replace().

overwritebool, optional

When True, re-enrich even if NLP fields are already set. Default False (skip already-enriched documents).

Returns:
list[CorpusDocument]

New document instances with NLP and metadata fields populated.

Parameters:
Return type:

list[Any]

Notes

Developer note: Documents are processed sequentially. For large corpora, call in batches to control memory.