DedupLinesNormalizer#

class scikitplot.corpus.DedupLinesNormalizer(ignore_whitespace=True, min_line_length=0)[source]#

Remove exact duplicate lines while preserving first-occurrence order.

Useful for de-noising OCR output and web-scraped text which often contains repeated navigation bars, headers, or footers.

Parameters:

ignore_whitespacebool, optional: When True, lines are compared after stripping; the original (un-stripped) line is preserved in the output. Default: True.
min_line_lengthint, optional: Lines shorter than this (after stripping) are always kept even if they are duplicates. Prevents discarding single-character structural lines. Default: 0.

Parameters:

ignore_whitespace (bool)
min_line_length (int)

Examples

>>> norm = DedupLinesNormalizer()
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.\\nHello.\\nWorld.")
>>> norm.normalize_doc(doc).normalized_text
'Hello.\\nWorld.'

normalize_doc(doc)[source]#

Remove duplicate lines from the document text.

Parameters:

docCorpusDocument

Returns:

CorpusDocument

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument