DefaultFilter#
- class scikitplot.corpus.DefaultFilter(min_words=3, min_chars=10)[source]#
Standard noise filter ported and improved from remarx’s
include_sentence.Rejects a document when any of the following is true:
The text contains no Unicode letter characters (punctuation/digit-only).
The whitespace-delimited token count is less than
min_words.The character count (after stripping) is less than
min_chars.
- Parameters:
- min_wordsint, optional
Minimum number of whitespace-delimited tokens. Default:
3.- min_charsint, optional
Minimum number of non-whitespace characters. Default:
10.
- Parameters:
Notes
The letter check uses
re.compile(r'[^\\W\\d_]', re.UNICODE)which matches any Unicode letter (including accented and non-Latin characters) while excluding digits and underscore. This is more robust than remarx’s original^[\\W\\d]+$which could pass on some Unicode inputs.Examples
>>> f = DefaultFilter(min_words=3, min_chars=10) >>> doc_ok = CorpusDocument.create("f.txt", 0, "Hello world test.") >>> doc_noise = CorpusDocument.create("f.txt", 1, "p. 56, 57.") >>> f.include(doc_ok) True >>> f.include(doc_noise) False
- include(doc)[source]#
Return
Trueifdocpasses all noise checks.- Parameters:
- docCorpusDocument
Document to evaluate.
- Returns:
- bool
Trueto include;Falseto discard.
- Parameters:
doc (CorpusDocument)
- Return type:
Notes
Character count is measured on the stripped text to avoid counting surrounding whitespace as content.