FilterBase#
- class scikitplot.corpus.FilterBase[source]#
Abstract base class for corpus document filters.
A filter receives a fully-constructed
CorpusDocumentand returnsTrueif it should be included in the output corpus,Falseif it should be discarded.Filters are applied after chunking and before embedding, so they operate on already-segmented text. This is the correct place to discard noise tokens, very short fragments, duplicate content, etc.
See also
scikitplot.corpus._filters.DefaultFilterStandard word-count + letter filter.
scikitplot.corpus._filters.CompositeFilterChain multiple filters with AND logic.
scikitplot.corpus._filters.SectionFilterFilter by SectionType membership.
Notes
Filters must be side-effect free — calling
include()must not modify the document or any shared state.Examples
Implementing a length filter:
>>> class LengthFilter(FilterBase): ... def __init__(self, min_chars: int = 10): ... self.min_chars = min_chars ... ... def include(self, doc): ... return len(doc.text) >= self.min_chars
- abstractmethod include(doc)[source]#
Return
Trueifdocshould be included in the corpus.- Parameters:
- docCorpusDocument
Document to evaluate. Must be a valid, validated instance.
- Returns:
- bool
Trueto include;Falseto discard.
- Parameters:
doc (CorpusDocument)
- Return type:
Notes
Must never raise for a valid
CorpusDocument. Unexpected inputs should returnFalsedefensively rather than raising, unless the error indicates a programming error (e.g.Nonepassed instead of a document).