NormalizerConfig#

class scikitplot.corpus.NormalizerConfig(unicode_form='NFKC', expand_ligatures=True, fix_hyphenation=True, collapse_whitespace=True, strip_control_chars=True, lowercase=False, min_length=1, custom_pipeline=<factory>, steps=None)[source]#

Configuration for TextNormalizer.

Parameters:
unicode_formstr

Unicode normalisation form: "NFKC" (default, canonical decomposition then compatibility composition) or "NFC", "NFD", "NFKD". Set to "" to disable.

expand_ligaturesbool

Replace common ligatures (fi → fi, fl → fl, ff → ff, etc.).

fix_hyphenationbool

Re-join words split across line breaks by a trailing hyphen (e.g., "compu-\\nter""computer").

collapse_whitespacebool

Replace runs of whitespace (including \\t, \\r, \\n) with a single space, then strip leading/trailing.

strip_control_charsbool

Remove Unicode category Cc/Cf chars except \\n and \\t. Removes zero-width joiners, BOM, soft hyphens, etc.

lowercasebool

Convert text to lowercase. Default False — casing often matters for named-entity recognition in RAG contexts.

min_lengthint

If the normalised text is shorter than this (in chars), set normalized_text = None so the embedding engine falls back to raw text.

custom_pipelinetuple of Callable[[str], str]

Additional user-supplied str str transforms applied after all built-in steps. Order is preserved.

stepslist of str or None, optional

Ordered list of step names to apply. None (default) derives the list automatically from the boolean flags above. Pass an explicit list to run only a named subset, e.g. steps=["unicode", "whitespace"]. Valid names: "unicode", "ligatures", "control_chars", "hyphenation", "whitespace", "lowercase", "custom". Used by TextNormalizer.normalize.

Parameters:

Notes

User note: The default configuration is designed for English-language PDF and OCR sources. For CJK text, set fix_hyphenation=False (CJK does not hyphenate) and unicode_form="NFKC" (normalises full-width characters).

Developer note: steps is excluded from __hash__ and __eq__ so that two configs with identical boolean flags but different steps lists are treated as equal for caching. If you need strict equality on steps, compare the lists explicitly.

collapse_whitespace: bool = True#
custom_pipeline: tuple[Callable[[str], str], ...][source]#
expand_ligatures: bool = True#
fix_hyphenation: bool = True#
lowercase: bool = False#
min_length: int = 1#
steps: list[str] | None = None#

Ordered list of normalisation step names to apply.

When None (default), the list is derived automatically from the boolean flags above in pipeline order: "unicode", "ligatures", "control_chars", "hyphenation", "whitespace", "lowercase", "custom". Pass an explicit list to run only a named subset, e.g. steps=["unicode", "whitespace"].

strip_control_chars: bool = True#
unicode_form: str = 'NFKC'#