NLPEnricher#
- class scikitplot.corpus.NLPEnricher(config=None)[source]#
Pipeline component that populates NLP enrichment fields on
CorpusDocument.- Parameters:
- configEnricherConfig or None, optional
Enrichment settings.
Noneuses defaults.
- Parameters:
config (EnricherConfig | None)
See also
scikitplot.corpus._normalizers._text_normalizer.TextNormalizerUpstream component that prepares
normalized_text.scikitplot.corpus._schema.CorpusDocumentThe
tokens,lemmas,stems,keywordsfields.
Notes
User note: Insert after
TextNormalizerand beforeEmbeddingEnginein the pipeline:source → reader → chunker → filter → normalizer → **enricher** → embedder
The enricher reads
doc.normalized_textwhen available, falling back todoc.text.Developer note: All NLP backends are lazy-loaded and cached on
self._*attributes. The class is NOT thread-safe (shared mutable cache). Use separate instances per thread.Examples
>>> enricher = NLPEnricher() >>> # doc = CorpusDocument(text="The quick brown fox.", ...) >>> # docs = enricher.enrich_documents([doc]) >>> # docs[0].tokens == ["quick", "brown", "fox"]
- enrich_documents(documents, *, overwrite=False)[source]#
Enrich a batch of
CorpusDocumentinstances.- Parameters:
- documentsSequence[CorpusDocument]
Documents to enrich. Not mutated.
- overwritebool, optional
Re-enrich even if NLP fields are already populated.
- Returns:
- list[CorpusDocument]
New instances with NLP fields populated.
- Parameters:
- Return type: