NLPEnricher#
- class scikitplot.corpus.NLPEnricher(config=None)[source]#
Pipeline component that populates NLP enrichment fields on
CorpusDocument.- Parameters:
- configEnricherConfig or None, optional
Enrichment settings.
Noneuses all defaults.
- Parameters:
config (EnricherConfig | None)
Notes
User note: Insert after
TextNormalizerand beforeEmbeddingEnginein the pipeline:source → reader → chunker → filter → normalizer → **enricher** → embedder
The enricher reads
doc.normalized_textwhen available, falling back todoc.text. Whenlanguage=None, the dominant script of each document is detected independently viadetect_script.Developer note: All NLP backends are lazy-loaded and cached on
self._*attributes. The class is NOT thread-safe. Use separate instances per thread.Examples
>>> cfg = EnricherConfig( ... language=["en", "ar"], ... keyword_extractor="tfidf", ... sentence_count=True, ... char_count=True, ... save_token_scores=True, ... ) >>> enricher = NLPEnricher(cfg) >>> # docs = enricher.enrich_documents([doc1, doc2])
- enrich_documents(documents, *, overwrite=False)[source]#
Enrich a batch of
CorpusDocumentinstances.- Parameters:
- documentsSequence[CorpusDocument]
Documents to enrich. Original objects are not mutated; new instances are returned via
doc.replace().- overwritebool, optional
When
True, re-enrich even if NLP fields are already set. DefaultFalse(skip already-enriched documents).
- Returns:
- list[CorpusDocument]
New document instances with NLP and metadata fields populated.
- Parameters:
- Return type:
Notes
Developer note: Documents are processed sequentially. For large corpora, call in batches to control memory.
Gallery examples#
corpus WHO European Region local or url per file with examples