DedupLinesNormalizer#
- class scikitplot.corpus.DedupLinesNormalizer(ignore_whitespace=True, min_line_length=0)[source]#
Remove exact duplicate lines while preserving first-occurrence order.
Useful for de-noising OCR output and web-scraped text which often contains repeated navigation bars, headers, or footers.
- Parameters:
- ignore_whitespacebool, optional
When
True, lines are compared after stripping; the original (un-stripped) line is preserved in the output. Default:True.- min_line_lengthint, optional
Lines shorter than this (after stripping) are always kept even if they are duplicates. Prevents discarding single-character structural lines. Default:
0.
- Parameters:
Examples
>>> norm = DedupLinesNormalizer() >>> doc = CorpusDocument.create("f.txt", 0, "Hello.\\nHello.\\nWorld.") >>> norm.normalize_doc(doc).normalized_text 'Hello.\\nWorld.'