UnicodeNormalizer#

class scikitplot.corpus.UnicodeNormalizer(form='NFC')[source]#

Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD).

Parameters:
form{"NFC", "NFD", "NFKC", "NFKD"}, optional

Unicode normalisation form. NFKC is recommended for NLP (decomposes ligatures, expands compatibility characters). Default: \"NFC\".

Parameters:

form (str)

Examples

>>> from scikitplot.corpus._normalizers import UnicodeNormalizer
>>> norm = UnicodeNormalizer(form="NFKC")
>>> doc = CorpusDocument.create("f.txt", 0, "file")  # fi ligature
>>> norm.normalize_doc(doc).normalized_text
'file'
normalize_doc(doc)[source]#

Apply Unicode normalisation to the document text.

Parameters:
docCorpusDocument
Returns:
CorpusDocument
Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument