NLPEnricher#

class scikitplot.corpus.NLPEnricher(config=None)[source]#

Pipeline component that populates NLP enrichment fields on CorpusDocument.

Parameters:
configEnricherConfig or None, optional

Enrichment settings. None uses defaults.

Parameters:

config (EnricherConfig | None)

See also

scikitplot.corpus._normalizers._text_normalizer.TextNormalizer

Upstream component that prepares normalized_text.

scikitplot.corpus._schema.CorpusDocument

The tokens, lemmas, stems, keywords fields.

Notes

User note: Insert after TextNormalizer and before EmbeddingEngine in the pipeline:

source → reader → chunker → filter → normalizer
  → **enricher** → embedder

The enricher reads doc.normalized_text when available, falling back to doc.text.

Developer note: All NLP backends are lazy-loaded and cached on self._* attributes. The class is NOT thread-safe (shared mutable cache). Use separate instances per thread.

Examples

>>> enricher = NLPEnricher()
>>> # doc = CorpusDocument(text="The quick brown fox.", ...)
>>> # docs = enricher.enrich_documents([doc])
>>> # docs[0].tokens == ["quick", "brown", "fox"]
enrich_documents(documents, *, overwrite=False)[source]#

Enrich a batch of CorpusDocument instances.

Parameters:
documentsSequence[CorpusDocument]

Documents to enrich. Not mutated.

overwritebool, optional

Re-enrich even if NLP fields are already populated.

Returns:
list[CorpusDocument]

New instances with NLP fields populated.

Parameters:
Return type:

list[Any]