UnicodeNormalizer#
- class scikitplot.corpus.UnicodeNormalizer(form='NFC')[source]#
Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD).
- Parameters:
- form{"NFC", "NFD", "NFKC", "NFKD"}, optional
Unicode normalisation form.
NFKCis recommended for NLP (decomposes ligatures, expands compatibility characters). Default:\"NFC\".
- Parameters:
form (str)
Examples
>>> from scikitplot.corpus._normalizers import UnicodeNormalizer >>> norm = UnicodeNormalizer(form="NFKC") >>> doc = CorpusDocument.create("f.txt", 0, "file") # fi ligature >>> norm.normalize_doc(doc).normalized_text 'file'