CustomNormalizer#

class scikitplot.corpus.CustomNormalizer(fn, *, name=None, text_mode=False)[source]#

Wrap any callable as a NormalizerBase.

Parameters:
fncallable

Normalizer callable. One of two signatures accepted:

(doc: CorpusDocument) -> CorpusDocument

Full document transform — the callable controls exactly which fields change via doc.replace().

(text: str) -> str

Pure text transform — the module wraps the result in doc.replace(normalized_text=result) automatically. Detected by inspecting whether the return value is a str.

namestr, optional

Human-readable label used in __repr__.

text_modebool, optional

When True, treat fn as a str str transform and wrap automatically. When False (default), treat fn as a full CorpusDocument CorpusDocument transform. Pass True for simple string-level operations (regex substitution, lowercasing, etc.) without writing the doc.replace() boilerplate.

Raises:
TypeError

If fn is not callable.

Parameters:
  • fn (Callable[..., Any])

  • name (str | None)

  • text_mode (bool)

See also

scikitplot.corpus._normalizers.NormalizationPipeline

Chain normalizers.

scikitplot.corpus._normalizers.NormalizerBase

Abstract base class.

Notes

User note: Combine with NormalizationPipeline to slot a custom step anywhere in the normalisation sequence.

Examples

Strip citation markers [1], [2] from academic text:

import re

def strip_citations(text: str) -> str:
    return re.sub(r"\\[\\d+\\]", "", text)

norm = CustomNormalizer(strip_citations, text_mode=True)

Full document transform (language detection side-channel):

def tag_language(doc):
    lang = detect(doc.normalized_text or doc.text)
    return doc.replace(language=lang)

norm = CustomNormalizer(tag_language)
normalize_doc(doc)[source]#

Apply the user-supplied callable to doc.

Parameters:
docCorpusDocument

Corpus Document.

Returns:
CorpusDocument

Modified document.

Raises:
RuntimeError

If the callable raises an unexpected exception.

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument