CustomNormalizer#

class scikitplot.corpus.CustomNormalizer(fn, *, name=None, text_mode=False)[source]#

Wrap any callable as a NormalizerBase.

Parameters:

fncallable

Normalizer callable. One of two signatures accepted:

(doc: CorpusDocument) -> CorpusDocument: Full document transform — the callable controls exactly which fields change via doc.replace().
(text: str) -> str: Pure text transform — the module wraps the result in doc.replace(normalized_text=result) automatically. Detected by inspecting whether the return value is a str.

namestr, optional

Human-readable label used in __repr__.

text_modebool, optional

When True, treat fn as a str → str transform and wrap automatically. When False (default), treat fn as a full CorpusDocument → CorpusDocument transform. Pass True for simple string-level operations (regex substitution, lowercasing, etc.) without writing the doc.replace() boilerplate.

Raises:

TypeError: If fn is not callable.

Parameters:

fn (Callable[..., Any])
name (str | None)
text_mode (bool)

See also

scikitplot.corpus._normalizers.NormalizationPipeline: Chain normalizers.
scikitplot.corpus._normalizers.NormalizerBase: Abstract base class.

Notes

User note: Combine with NormalizationPipeline to slot a custom step anywhere in the normalisation sequence.

Examples

Strip citation markers [1], [2] from academic text:

import re

def strip_citations(text: str) -> str:
    return re.sub(r"\\[\\d+\\]", "", text)

norm = CustomNormalizer(strip_citations, text_mode=True)

Full document transform (language detection side-channel):

def tag_language(doc):
    lang = detect(doc.normalized_text or doc.text)
    return doc.replace(language=lang)

norm = CustomNormalizer(tag_language)

normalize_doc(doc)[source]#

Apply the user-supplied callable to doc.

Parameters:

docCorpusDocument: Corpus Document.

Returns:

CorpusDocument: Modified document.

Raises:

RuntimeError: If the callable raises an unexpected exception.

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument