HookableCorpusPipeline#

class scikitplot.corpus.HookableCorpusPipeline(hooks=None, chunker=None, filter_=None, embedding_engine=None, output_dir=None, export_format=None, default_language=None, progress_callback=None, reader_kwargs=None)[source]#

CorpusPipeline extended with per-stage lifecycle hooks.

Accepts all the same constructor parameters as CorpusPipeline plus a PipelineHooks instance. Drop-in replacement — the public interface (run, run_batch, run_url) is identical.

Parameters:
hooksPipelineHooks or None, optional

Lifecycle callbacks. None disables all hooks (identical behaviour to bare CorpusPipeline).

chunkerChunkerBase or None, optional

Chunker to inject into every reader.

filter_FilterBase or None, optional

Filter applied after chunking.

embedding_engineEmbeddingEngine or None, optional

Embedding backend.

output_dirpathlib.Path or None, optional

Output directory for exports.

export_formatExportFormat or None, optional

Default export format.

default_languagestr or None, optional

ISO 639-1 language code.

progress_callbackcallable or None, optional

Progress notification callback.

reader_kwargsdict or None, optional

Extra kwargs forwarded to each reader.

Parameters:
  • hooks (PipelineHooks | None)

  • chunker (Any | None)

  • filter_ (Any | None)

  • embedding_engine (Any | None)

  • output_dir (Any | None)

  • export_format (Any | None)

  • default_language (str | None)

  • progress_callback (Callable[[str, int, int], None] | None)

  • reader_kwargs (dict[str, Any] | None)

Notes

User note: All hooks are called with a try/except guard — a hook that raises does not abort the pipeline.

Developer note: Hook injection points:

  • pre_read_hook → called at the start of _run_source.

  • post_read_hook → called after _collect_documents; may return a modified list (None return = unchanged).

  • post_filter_hook → installed as an additional post-filter on the FilterBase passed to the reader via a _CompositeHookFilter wrapper.

  • post_embed_hook → called after _embed_documents.

  • pre_export_hook → called inside _export before writing.

Examples

from scikitplot.corpus._custom_hooks import (
    HookableCorpusPipeline,
    PipelineHooks,
)

hooks = PipelineHooks(
    pre_read_hook=lambda src: logger.info("Reading: %s", src),
    post_read_hook=lambda src, docs: [d for d in docs if len(d.text) > 50],
)
pipeline = HookableCorpusPipeline(hooks=hooks, output_dir=Path("out/"))
result = pipeline.run(Path("corpus.pdf"))
run(input_file, *, output_path=None, export_format=None, filename_override=None)[source]#

Process a single source with lifecycle hooks applied.

Parameters:
input_filepathlib.Path or str

Path or URL.

output_pathpathlib.Path or None, optional

output_path.

export_formatExportFormat or None, optional

export_format.

filename_overridestr or None, optional

filename_override.

Returns:
PipelineResult
Parameters:
  • input_file (Any)

  • output_path (Any | None)

  • export_format (Any | None)

  • filename_override (str | None)

Return type:

Any

run_batch(input_files, *, stop_on_error=False, export_format=None)[source]#

Process multiple sources with hooks applied to each.

Parameters:
input_fileslist[pathlib.Path or str]

input_files.

stop_on_errorbool, optional

stop_on_error.

export_formatExportFormat or None, optional

export_format.

Returns:
list[PipelineResult]
Parameters:
  • input_files (list[Any])

  • stop_on_error (bool)

  • export_format (Any | None)

Return type:

list[Any]

run_url(url, *, output_path=None, export_format=None, stop_on_error=False)[source]#

Process one URL or a list of URLs with hooks applied.

Parameters:
urlstr or list[str]

url.

output_pathpathlib.Path or None, optional

output_path.

export_formatExportFormat or None, optional

export_format.

stop_on_errorbool, optional

stop_on_error.

Returns:
PipelineResult or list[PipelineResult]
Parameters:
  • url (Any)

  • output_path (Any | None)

  • export_format (Any | None)

  • stop_on_error (bool)

Return type:

Any