DefaultFilter#

class scikitplot.corpus.DefaultFilter(min_words=3, min_chars=10)[source]#

Standard noise filter ported and improved from remarx’s include_sentence.

Rejects a document when any of the following is true:

  1. The text contains no Unicode letter characters (punctuation/digit-only).

  2. The whitespace-delimited token count is less than min_words.

  3. The character count (after stripping) is less than min_chars.

Parameters:
min_wordsint, optional

Minimum number of whitespace-delimited tokens. Default: 3.

min_charsint, optional

Minimum number of non-whitespace characters. Default: 10.

Parameters:
  • min_words (int)

  • min_chars (int)

Notes

The letter check uses re.compile(r'[^\\W\\d_]', re.UNICODE) which matches any Unicode letter (including accented and non-Latin characters) while excluding digits and underscore. This is more robust than remarx’s original ^[\\W\\d]+$ which could pass on some Unicode inputs.

Examples

>>> f = DefaultFilter(min_words=3, min_chars=10)
>>> doc_ok = CorpusDocument.create("f.txt", 0, "Hello world test.")
>>> doc_noise = CorpusDocument.create("f.txt", 1, "p. 56, 57.")
>>> f.include(doc_ok)
True
>>> f.include(doc_noise)
False
include(doc)[source]#

Return True if doc passes all noise checks.

Parameters:
docCorpusDocument

Document to evaluate.

Returns:
bool

True to include; False to discard.

Parameters:

doc (CorpusDocument)

Return type:

bool

Notes

Character count is measured on the stripped text to avoid counting surrounding whitespace as content.

min_chars: int#
min_words: int#