PipelineHooks#

class scikitplot.corpus.PipelineHooks(pre_read_hook=None, post_read_hook=None, post_filter_hook=None, post_embed_hook=None, pre_export_hook=None)[source]#

Lifecycle callbacks for HookableCorpusPipeline.

Every hook is optional (None = no-op). Hooks are called in the order listed here, at the pipeline stages indicated.

Parameters:
pre_read_hookcallable or None, optional

Called before the reader iterates a source. Receives the source label string. Signature:

def pre_read_hook(source: str) -> None: ...
post_read_hookcallable or None, optional

Called after all documents have been read (before embedding). Receives the source label and the collected document list. May return a modified document list or None (no modification). Signature:

def post_read_hook(
    source: str,
    documents: list[CorpusDocument],
) -> list[CorpusDocument] | None: ...
post_filter_hookcallable or None, optional

Called after the built-in filter stage (inside the reader’s get_documents()). This hook runs per-document as a final inclusion gate — return True to keep, False to discard. Signature:

def post_filter_hook(doc: CorpusDocument) -> bool: ...
post_embed_hookcallable or None, optional

Called after embedding is complete. Receives (source, documents) and may return a modified list or None. Signature:

def post_embed_hook(
    source: str,
    documents: list[CorpusDocument],
) -> list[CorpusDocument] | None: ...
pre_export_hookcallable or None, optional

Called before exporting documents to disk. May return a modified document list or None. Signature:

def pre_export_hook(
    source: str,
    documents: list[CorpusDocument],
) -> list[CorpusDocument] | None: ...
Parameters:

Notes

User note: Hooks are called with minimal overhead — only non-None hooks incur a function call. Hook exceptions are caught and logged as warnings; they never abort the pipeline run.

Examples

Log progress and filter by source-type in post_read:

def log_read(source, docs):
    print(f"{source}: {len(docs)} documents read")

def keep_research(source, docs):
    from scikitplot.corpus._schema import SourceType

    return [d for d in docs if d.source_type == SourceType.RESEARCH]

hooks = PipelineHooks(
    pre_read_hook=lambda src: print(f"Starting: {src}"),
    post_read_hook=lambda src, docs: log_read(src, docs)
    or keep_research(src, docs),
)
pipeline = HookableCorpusPipeline(hooks=hooks)
post_embed_hook: Callable[[str, list[Any]], list[Any] | None] | None = None#
post_filter_hook: Callable[[Any], bool] | None = None#
post_read_hook: Callable[[str, list[Any]], list[Any] | None] | None = None#
pre_export_hook: Callable[[str, list[Any]], list[Any] | None] | None = None#
pre_read_hook: Callable[[str], None] | None = None#