DefaultFilter#

class scikitplot.corpus.DefaultFilter(min_words=3, min_chars=10)[source]#

Standard noise filter ported and improved from remarx’s include_sentence.

Rejects a document when any of the following is true:

The text contains no Unicode letter characters (punctuation/digit-only).
The whitespace-delimited token count is less than min_words.
The character count (after stripping) is less than min_chars.

Parameters:

min_wordsint, optional: Minimum number of whitespace-delimited tokens. Default: 3.
min_charsint, optional: Minimum number of non-whitespace characters. Default: 10.

Parameters:

min_words (int)
min_chars (int)

Notes

The letter check uses re.compile(r'[^\\W\\d_]', re.UNICODE) which matches any Unicode letter (including accented and non-Latin characters) while excluding digits and underscore. This is more robust than remarx’s original ^[\\W\\d]+$ which could pass on some Unicode input_path.

Examples

>>> f = DefaultFilter(min_words=3, min_chars=10)
>>> doc_ok = CorpusDocument.create("f.txt", 0, "Hello world test.")
>>> doc_noise = CorpusDocument.create("f.txt", 1, "p. 56, 57.")
>>> f.include(doc_ok)
True
>>> f.include(doc_noise)
False

include(doc)[source]#

Return True if doc passes all noise checks.

Parameters:

docCorpusDocument: Document to evaluate.

Returns:

bool: True to include; False to discard.

Parameters:

doc (CorpusDocument)

Return type:

bool

Notes

Character count is measured on the stripped text to avoid counting surrounding whitespace as content.

min_chars: int#

min_words: int#