TextNormalizer#
- class scikitplot.corpus.TextNormalizer(config=None)[source]#
Pipeline component that populates
normalized_textonCorpusDocumentinstances.- Parameters:
- configNormalizerConfig or None, optional
Normalisation settings.
Noneuses defaults.
- Parameters:
config (NormalizerConfig | None)
See also
scikitplot.corpus._schema.CorpusDocumentThe normalised
normalized_textfield.scikitplot.corpus._enrichers._nlp_enricher.NLPEnricherDownstream component that tokenises
normalized_text.
Notes
User note: Insert this component between the filter and embedding stages:
source → reader → chunker → filter → **normalizer** → embedder
If
normalized_textis already set on a document (e.g., by a reader that does its own cleaning), this component skips it unlessoverwrite=Trueis passed tonormalize_documents.Developer note: This class is stateless and thread-safe. All mutable state lives in the documents being processed.
Examples
>>> from scikitplot.corpus._normalizers._text_normalizer import ( ... TextNormalizer, ... ) >>> normalizer = TextNormalizer() >>> # doc = CorpusDocument(text="The first compu-\\nter.", ...) >>> # docs = normalizer.normalize_documents([doc]) >>> # docs[0].normalized_text == "The first computer."
- normalize(text)[source]#
Normalise a single string using only the steps in
config.steps.Unlike
normalize_text, this method:Applies steps selectively — only those listed in
self.config.stepsare executed, in that order.Returns
""for empty input rather thanNone.Never returns
None— callers that need the min-length guard should usenormalize_textdirectly.
- Parameters:
- textstr
Raw text to normalise.
- Returns:
- str
Normalised text, or
""if text is empty or becomes empty after normalisation.
- Parameters:
text (str)
- Return type:
Examples
>>> n = TextNormalizer(NormalizerConfig(steps=["unicode"])) >>> "\\ufb01" not in n.normalize("fi\\ufb01rst") True >>> n2 = TextNormalizer(NormalizerConfig(steps=["whitespace"])) >>> " " not in n2.normalize("Hello world") True >>> TextNormalizer(NormalizerConfig()).normalize("") ''
- normalize_documents(documents, *, overwrite=False)[source]#
Normalise text for a batch of
CorpusDocumentinstances.- Parameters:
- documentsSequence[CorpusDocument]
Documents to normalise. Not mutated — new instances are returned via
doc.replace().- overwritebool, optional
If
True, re-normalise even ifnormalized_textis already set. DefaultFalse.
- Returns:
- list[CorpusDocument]
New document instances with
normalized_textpopulated.
- Parameters:
- Return type: