FilterBase#

class scikitplot.corpus.FilterBase[source]#

Abstract base class for corpus document filters.

A filter receives a fully-constructed CorpusDocument and returns True if it should be included in the output corpus, False if it should be discarded.

Filters are applied after chunking and before embedding, so they operate on already-segmented text. This is the correct place to discard noise tokens, very short fragments, duplicate content, etc.

See also

scikitplot.corpus._filters.DefaultFilter: Standard word-count + letter filter.
scikitplot.corpus._filters.CompositeFilter: Chain multiple filters with AND logic.
scikitplot.corpus._filters.SectionFilter: Filter by SectionType membership.

Notes

Filters must be side-effect free — calling include() must not modify the document or any shared state.

Examples

Implementing a length filter:

>>> class LengthFilter(FilterBase):
...     def __init__(self, min_chars: int = 10):
...         self.min_chars = min_chars
...
...     def include(self, doc):
...         return len(doc.text) >= self.min_chars

abstractmethod include(doc)[source]#

Return True if doc should be included in the corpus.

Parameters:

docCorpusDocument: Document to evaluate. Must be a valid, validated instance.

Returns:

bool: True to include; False to discard.

Parameters:

doc (CorpusDocument)

Return type:

bool

Notes

Must never raise for a valid CorpusDocument. Unexpected input_path should return False defensively rather than raising, unless the error indicates a programming error (e.g. None passed instead of a document).