NLPEnricher#

class scikitplot.corpus.NLPEnricher(config=None)[source]#

Pipeline component that populates NLP enrichment fields on CorpusDocument.

Parameters:

configEnricherConfig or None, optional: Enrichment settings. None uses all defaults.

Parameters:

Notes

User note: Insert after TextNormalizer and before EmbeddingEngine in the pipeline:

source → reader → chunker → filter → normalizer
  → **enricher** → embedder

The enricher reads doc.normalized_text when available, falling back to doc.text. When language=None, the dominant script of each document is detected independently via detect_script.

Developer note: All NLP backends are lazy-loaded and cached on self._* attributes. The class is NOT thread-safe. Use separate instances per thread.

Examples

>>> cfg = EnricherConfig(
...     language=["en", "ar"],
...     keyword_extractor="tfidf",
...     sentence_count=True,
...     char_count=True,
...     save_token_scores=True,
... )
>>> enricher = NLPEnricher(cfg)
>>> # docs = enricher.enrich_documents([doc1, doc2])

enrich_documents(documents, *, overwrite=False)[source]#

Enrich a batch of CorpusDocument instances.

Parameters:

documentsSequence[CorpusDocument]: Documents to enrich. Original objects are not mutated; new instances are returned via doc.replace().
overwritebool, optional: When True, re-enrich even if NLP fields are already set. Default False (skip already-enriched documents).

Returns:

list[CorpusDocument]: New document instances with NLP and metadata fields populated.

Parameters:

documents (Sequence[Any])
overwrite (bool)

Return type:

list[Any]

Notes

Developer note: Documents are processed sequentially. For large corpora, call in batches to control memory.

Gallery examples#

corpus A Tale of Two Cities .mp3 with examples

corpus WHO European Region local or url per file with examples