HookableCorpusPipeline#
- class scikitplot.corpus.HookableCorpusPipeline(hooks=None, chunker=None, filter_=None, embedding_engine=None, output_dir=None, export_format=None, default_language=None, progress_callback=None, reader_kwargs=None)[source]#
CorpusPipelineextended with per-stage lifecycle hooks.Accepts all the same constructor parameters as
CorpusPipelineplus aPipelineHooksinstance. Drop-in replacement — the public interface (run,run_batch,run_url) is identical.- Parameters:
- hooksPipelineHooks or None, optional
Lifecycle callbacks.
Nonedisables all hooks (identical behaviour to bareCorpusPipeline).- chunkerChunkerBase or None, optional
Chunker to inject into every reader.
- filter_FilterBase or None, optional
Filter applied after chunking.
- embedding_engineEmbeddingEngine or None, optional
Embedding backend.
- output_dirpathlib.Path or None, optional
Output directory for exports.
- export_formatExportFormat or None, optional
Default export format.
- default_languagestr or None, optional
ISO 639-1 language code.
- progress_callbackcallable or None, optional
Progress notification callback.
- reader_kwargsdict or None, optional
Extra kwargs forwarded to each reader.
- Parameters:
Notes
User note: All hooks are called with a
try/exceptguard — a hook that raises does not abort the pipeline.Developer note: Hook injection points:
pre_read_hook→ called at the start of_run_source.post_read_hook→ called after_collect_documents; may return a modified list (Nonereturn = unchanged).post_filter_hook→ installed as an additional post-filter on theFilterBasepassed to the reader via a_CompositeHookFilterwrapper.post_embed_hook→ called after_embed_documents.pre_export_hook→ called inside_exportbefore writing.
Examples
from scikitplot.corpus._custom_hooks import ( HookableCorpusPipeline, PipelineHooks, ) hooks = PipelineHooks( pre_read_hook=lambda src: logger.info("Reading: %s", src), post_read_hook=lambda src, docs: [d for d in docs if len(d.text) > 50], ) pipeline = HookableCorpusPipeline(hooks=hooks, output_dir=Path("out/")) result = pipeline.run(Path("corpus.pdf"))
- run(input_file, *, output_path=None, export_format=None, filename_override=None)[source]#
Process a single source with lifecycle hooks applied.
- Parameters:
- input_filepathlib.Path or str
Path or URL.
- output_pathpathlib.Path or None, optional
output_path.
- export_formatExportFormat or None, optional
export_format.
- filename_overridestr or None, optional
filename_override.
- Returns:
- PipelineResult
- Parameters:
- Return type:
- run_batch(input_files, *, stop_on_error=False, export_format=None)[source]#
Process multiple sources with hooks applied to each.
- run_url(url, *, output_path=None, export_format=None, stop_on_error=False)[source]#
Process one URL or a list of URLs with hooks applied.
- Parameters:
- urlstr or list[str]
url.
- output_pathpathlib.Path or None, optional
output_path.
- export_formatExportFormat or None, optional
export_format.
- stop_on_errorbool, optional
stop_on_error.
- Returns:
- PipelineResult or list[PipelineResult]
- Parameters:
- Return type: