NormalizerConfig#
- class scikitplot.corpus.NormalizerConfig(unicode_form='NFKC', expand_ligatures=True, fix_hyphenation=True, collapse_whitespace=True, strip_control_chars=True, lowercase=False, min_length=1, custom_pipeline=<factory>, steps=None)[source]#
Configuration for
TextNormalizer.- Parameters:
- unicode_formstr
Unicode normalisation form:
"NFKC"(default, canonical decomposition then compatibility composition) or"NFC","NFD","NFKD". Set to""to disable.- expand_ligaturesbool
Replace common ligatures (fi → fi, fl → fl, ff → ff, etc.).
- fix_hyphenationbool
Re-join words split across line breaks by a trailing hyphen (e.g.,
"compu-\\nter"→"computer").- collapse_whitespacebool
Replace runs of whitespace (including
\\t,\\r,\\n) with a single space, then strip leading/trailing.- strip_control_charsbool
Remove Unicode category Cc/Cf chars except
\\nand\\t. Removes zero-width joiners, BOM, soft hyphens, etc.- lowercasebool
Convert text to lowercase. Default False — casing often matters for named-entity recognition in RAG contexts.
- min_lengthint
If the normalised text is shorter than this (in chars), set
normalized_text = Noneso the embedding engine falls back to rawtext.- custom_pipelinetuple of Callable[[str], str]
Additional user-supplied
str → strtransforms applied after all built-in steps. Order is preserved.- stepslist of str or None, optional
Ordered list of step names to apply.
None(default) derives the list automatically from the boolean flags above. Pass an explicit list to run only a named subset, e.g.steps=["unicode", "whitespace"]. Valid names:"unicode","ligatures","control_chars","hyphenation","whitespace","lowercase","custom". Used byTextNormalizer.normalize.
- Parameters:
Notes
User note: The default configuration is designed for English-language PDF and OCR sources. For CJK text, set
fix_hyphenation=False(CJK does not hyphenate) andunicode_form="NFKC"(normalises full-width characters).Developer note:
stepsis excluded from__hash__and__eq__so that two configs with identical boolean flags but differentstepslists are treated as equal for caching. If you need strict equality onsteps, compare the lists explicitly.- steps: list[str] | None = None#
Ordered list of normalisation step names to apply.
When
None(default), the list is derived automatically from the boolean flags above in pipeline order:"unicode","ligatures","control_chars","hyphenation","whitespace","lowercase","custom". Pass an explicit list to run only a named subset, e.g.steps=["unicode", "whitespace"].