HookableCorpusPipeline#

class scikitplot.corpus.HookableCorpusPipeline(hooks=None, chunker=None, filter_=None, embedding_engine=None, output_path=None, export_format=None, default_language=None, progress_callback=None, reader_kwargs=None)[source]#

CorpusPipeline extended with per-stage lifecycle hooks.

Accepts all the same constructor parameters as CorpusPipeline plus a PipelineHooks instance. Drop-in replacement — the public interface (run, run_batch, run_url) is identical.

Parameters:

hooksPipelineHooks or None, optional: Lifecycle callbacks. None disables all hooks (identical behaviour to bare CorpusPipeline).
chunkerChunkerBase or None, optional: Chunker to inject into every reader.
filter_FilterBase or None, optional: Filter applied after chunking.
embedding_engineEmbeddingEngine or None, optional: Embedding backend.
output_pathpathlib.Path or None, optional: Output directory for exports.
export_formatExportFormat or None, optional: Default export format.
default_languagestr or None, optional: ISO 639-1 language code.
progress_callbackcallable or None, optional: Progress notification callback.
reader_kwargsdict or None, optional: Extra kwargs forwarded to each reader.

Parameters:

hooks (PipelineHooks | None)
chunker (Any | None)
filter_ (Any | None)
embedding_engine (Any | None)
output_path (Any | None)
export_format (Any | None)
default_language (str | None)
progress_callback (Callable[[str, int, int], None] | None)
reader_kwargs (dict[str, Any] | None)

Notes

User note: All hooks are called with a try/except guard — a hook that raises does not abort the pipeline.

Developer note: Hook injection points:

pre_read_hook → called at the start of _run_source.
post_read_hook → called after _collect_documents; may return a modified list (None return = unchanged).
post_filter_hook → installed as an additional post-filter on the FilterBase passed to the reader via a _CompositeHookFilter wrapper.
post_embed_hook → called after _embed_documents.
pre_export_hook → called inside _export before writing.

Examples

from scikitplot.corpus._custom_hooks import (
    HookableCorpusPipeline,
    PipelineHooks,
)

hooks = PipelineHooks(
    pre_read_hook=lambda src: logger.info("Reading: %s", src),
    post_read_hook=lambda src, docs: [d for d in docs if len(d.text) > 50],
)
pipeline = HookableCorpusPipeline(hooks=hooks, output_path=Path("out/"))
result = pipeline.run(Path("corpus.pdf"))

run(input_path, *, output_path=None, export_format=None, filename_override=None)[source]#

Process a single source with lifecycle hooks applied.

Parameters:

input_pathstr or pathlib.Path: Path or URL.
output_pathpathlib.Path or None, optional: output_path.
export_formatExportFormat or None, optional: export_format.
filename_overridestr or None, optional: filename_override.

Returns:

PipelineResult

Parameters:

input_path (str | pathlib.Path)
output_path (Any | None)
export_format (Any | None)
filename_override (str | None)

Return type:

Any

run_batch(input_files, *, stop_on_error=False, export_format=None)[source]#

Process multiple sources with hooks applied to each.

Parameters:

input_fileslist[pathlib.Path or str]: input_files.
stop_on_errorbool, optional: stop_on_error.
export_formatExportFormat or None, optional: export_format.

Returns:

list[PipelineResult]

Parameters:

input_files (list[Any])
stop_on_error (bool)
export_format (Any | None)

Return type:

list[Any]

run_url(url, *, output_path=None, export_format=None, stop_on_error=False)[source]#

Process one URL or a list of URLs with hooks applied.

Parameters:

urlstr or list[str]: url.
output_pathpathlib.Path or None, optional: output_path.
export_formatExportFormat or None, optional: export_format.
stop_on_errorbool, optional: stop_on_error.

Returns:

PipelineResult or list[PipelineResult]

Parameters:

url (Any)
output_path (Any | None)
export_format (Any | None)
stop_on_error (bool)

Return type:

Any