TextNormalizer#

class scikitplot.corpus.TextNormalizer(config=None)[source]#

Pipeline component that populates normalized_text on CorpusDocument instances.

Parameters:
configNormalizerConfig or None, optional

Normalisation settings. None uses defaults.

Parameters:

config (NormalizerConfig | None)

See also

scikitplot.corpus._schema.CorpusDocument

The normalised normalized_text field.

scikitplot.corpus._enrichers._nlp_enricher.NLPEnricher

Downstream component that tokenises normalized_text.

Notes

User note: Insert this component between the filter and embedding stages:

source → reader → chunker → filter → **normalizer** → embedder

If normalized_text is already set on a document (e.g., by a reader that does its own cleaning), this component skips it unless overwrite=True is passed to normalize_documents.

Developer note: This class is stateless and thread-safe. All mutable state lives in the documents being processed.

Examples

>>> from scikitplot.corpus._normalizers._text_normalizer import (
...     TextNormalizer,
... )
>>> normalizer = TextNormalizer()
>>> # doc = CorpusDocument(text="The  first  compu-\\nter.", ...)
>>> # docs = normalizer.normalize_documents([doc])
>>> # docs[0].normalized_text == "The first computer."
normalize(text)[source]#

Normalise a single string using only the steps in config.steps.

Unlike normalize_text, this method:

  • Applies steps selectively — only those listed in self.config.steps are executed, in that order.

  • Returns "" for empty input rather than None.

  • Never returns None — callers that need the min-length guard should use normalize_text directly.

Parameters:
textstr

Raw text to normalise.

Returns:
str

Normalised text, or "" if text is empty or becomes empty after normalisation.

Parameters:

text (str)

Return type:

str

Examples

>>> n = TextNormalizer(NormalizerConfig(steps=["unicode"]))
>>> "\\ufb01" not in n.normalize("fi\\ufb01rst")
True
>>> n2 = TextNormalizer(NormalizerConfig(steps=["whitespace"]))
>>> "   " not in n2.normalize("Hello   world")
True
>>> TextNormalizer(NormalizerConfig()).normalize("")
''
normalize_documents(documents, *, overwrite=False)[source]#

Normalise text for a batch of CorpusDocument instances.

Parameters:
documentsSequence[CorpusDocument]

Documents to normalise. Not mutated — new instances are returned via doc.replace().

overwritebool, optional

If True, re-normalise even if normalized_text is already set. Default False.

Returns:
list[CorpusDocument]

New document instances with normalized_text populated.

Parameters:
Return type:

list[Any]