TextNormalizer#

class scikitplot.corpus.TextNormalizer(config=None)[source]#

Pipeline component that populates normalized_text on CorpusDocument instances.

Parameters:

configNormalizerConfig or None, optional: Normalisation settings. None uses defaults.

Parameters:

config (NormalizerConfig | None)

See also

scikitplot.corpus._schema.CorpusDocument: The normalised normalized_text field.
scikitplot.corpus._enrichers._nlp_enricher.NLPEnricher: Downstream component that tokenises normalized_text.

Notes

User note: Insert this component between the filter and embedding stages:

source → reader → chunker → filter → **normalizer** → embedder

If normalized_text is already set on a document (e.g., by a reader that does its own cleaning), this component skips it unless overwrite=True is passed to normalize_documents.

Developer note: This class is stateless and thread-safe. All mutable state lives in the documents being processed.

Examples

>>> from scikitplot.corpus._normalizers._text_normalizer import (
...     TextNormalizer,
... )
>>> normalizer = TextNormalizer()
>>> # doc = CorpusDocument(text="The  ﬁrst  compu-\\nter.", ...)
>>> # docs = normalizer.normalize_documents([doc])
>>> # docs[0].normalized_text == "The first computer."

normalize(text)[source]#

Normalise a single string using only the steps in config.steps.

Unlike normalize_text, this method:

Applies steps selectively — only those listed in self.config.steps are executed, in that order.
Returns "" for empty input rather than None.
Never returns None — callers that need the min-length guard should use normalize_text directly.

Parameters:

textstr: Raw text to normalise.

Returns:

str: Normalised text, or "" if text is empty or becomes empty after normalisation.

Parameters:

text (str)

Return type:

str

Examples

>>> n = TextNormalizer(NormalizerConfig(steps=["unicode"]))
>>> "\\ufb01" not in n.normalize("fi\\ufb01rst")
True
>>> n2 = TextNormalizer(NormalizerConfig(steps=["whitespace"]))
>>> "   " not in n2.normalize("Hello   world")
True
>>> TextNormalizer(NormalizerConfig()).normalize("")
''

normalize_documents(documents, *, overwrite=False)[source]#

Normalise text for a batch of CorpusDocument instances.

Parameters:

documentsSequence[CorpusDocument]: Documents to normalise. Not mutated — new instances are returned via doc.replace().
overwritebool, optional: If True, re-normalise even if normalized_text is already set. Default False.

Returns:

list[CorpusDocument]: New document instances with normalized_text populated.

Parameters:

documents (Sequence[Any])
overwrite (bool)

Return type:

list[Any]