PipelineHooks#
- class scikitplot.corpus.PipelineHooks(pre_read_hook=None, post_read_hook=None, post_filter_hook=None, post_embed_hook=None, pre_export_hook=None)[source]#
Lifecycle callbacks for
HookableCorpusPipeline.Every hook is optional (
None= no-op). Hooks are called in the order listed here, at the pipeline stages indicated.- Parameters:
- pre_read_hookcallable or None, optional
Called before the reader iterates a source. Receives the source label string. Signature:
def pre_read_hook(source: str) -> None: ...
- post_read_hookcallable or None, optional
Called after all documents have been read (before embedding). Receives the source label and the collected document list. May return a modified document list or
None(no modification). Signature:def post_read_hook( source: str, documents: list[CorpusDocument], ) -> list[CorpusDocument] | None: ...
- post_filter_hookcallable or None, optional
Called after the built-in filter stage (inside the reader’s
get_documents()). This hook runs per-document as a final inclusion gate — returnTrueto keep,Falseto discard. Signature:def post_filter_hook(doc: CorpusDocument) -> bool: ...
- post_embed_hookcallable or None, optional
Called after embedding is complete. Receives
(source, documents)and may return a modified list orNone. Signature:def post_embed_hook( source: str, documents: list[CorpusDocument], ) -> list[CorpusDocument] | None: ...
- pre_export_hookcallable or None, optional
Called before exporting documents to disk. May return a modified document list or
None. Signature:def pre_export_hook( source: str, documents: list[CorpusDocument], ) -> list[CorpusDocument] | None: ...
- Parameters:
Notes
User note: Hooks are called with minimal overhead — only non-
Nonehooks incur a function call. Hook exceptions are caught and logged as warnings; they never abort the pipeline run.Examples
Log progress and filter by source-type in post_read:
def log_read(source, docs): print(f"{source}: {len(docs)} documents read") def keep_research(source, docs): from scikitplot.corpus._schema import SourceType return [d for d in docs if d.source_type == SourceType.RESEARCH] hooks = PipelineHooks( pre_read_hook=lambda src: print(f"Starting: {src}"), post_read_hook=lambda src, docs: log_read(src, docs) or keep_research(src, docs), ) pipeline = HookableCorpusPipeline(hooks=hooks)