DedupLinesNormalizer#

class scikitplot.corpus.DedupLinesNormalizer(ignore_whitespace=True, min_line_length=0)[source]#

Remove exact duplicate lines while preserving first-occurrence order.

Useful for de-noising OCR output and web-scraped text which often contains repeated navigation bars, headers, or footers.

Parameters:
ignore_whitespacebool, optional

When True, lines are compared after stripping; the original (un-stripped) line is preserved in the output. Default: True.

min_line_lengthint, optional

Lines shorter than this (after stripping) are always kept even if they are duplicates. Prevents discarding single-character structural lines. Default: 0.

Parameters:
  • ignore_whitespace (bool)

  • min_line_length (int)

Examples

>>> norm = DedupLinesNormalizer()
>>> doc = CorpusDocument.create("f.txt", 0, "Hello.\\nHello.\\nWorld.")
>>> norm.normalize_doc(doc).normalized_text
'Hello.\\nWorld.'
normalize_doc(doc)[source]#

Remove duplicate lines from the document text.

Parameters:
docCorpusDocument
Returns:
CorpusDocument
Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument