CorpusPipeline#

class scikitplot.corpus.CorpusPipeline(chunker=None, filter_=None, embedding_engine=None, output_dir=None, export_format=ExportFormat.CSV, normalizer=None, enricher=None, default_language=None, progress_callback=None, reader_kwargs=None)[source]#

Orchestrates the full corpus ingestion pipeline.

Instantiate once, then call run (single file), run_batch (multiple files), or run_url (URL source) any number of times. The pipeline is stateless between calls; all configuration is set at construction time.

Parameters:

chunkerChunkerBase or None, optional

Chunker to inject into every reader. None yields one CorpusDocument per raw chunk. Default: None.

filter_FilterBase or None, optional

Filter applied after chunking. None uses DefaultFilter. Default: None.

embedding_engineEmbeddingEngine or None, optional

When provided, documents are embedded in batches after chunking/filtering. Embeddings are stored in embedding. Default: None (no embedding).

output_dirpathlib.Path or None, optional

Directory where exported files are written. When None, export is skipped unless output_path is supplied explicitly in a run call. Default: None.

export_formatExportFormat or None, optional

Default export format. Individual run calls can override. Default: CSV.

default_languagestr or None, optional

ISO 639-1 language code applied to all documents when the reader cannot detect language. Default: None.

progress_callbackcallable or None, optional

Called after each batch of documents is processed. Signature: (source: str, n_done: int, n_total_estimate: int) → None. n_total_estimate is -1 when the total is unknown. Default: None.

normalizerTextNormalizer or None, optional

When provided, normalized_text is populated on every document after chunking/filtering and before embedding. Insert between the filter and embedding stages to clean OCR noise, collapsed whitespace, ligatures, and other artefacts. Default: None (skip).

enricherNLPEnricher or None, optional

When provided, NLP enrichment fields (tokens, lemmas, stems, keywords, and optional metadata such as pos_tags, ner_entities, sentence_count, char_count, type_token_ratio, token_scores) are populated on every document after normalisation and before embedding. Supports 200+ world languages via the language parameter of EnricherConfig. Default: None (skip).

default_languagestr or list[str] or None, optional

ISO 639-1 language code (or list of codes, or None) applied to all documents when the reader cannot detect language. Accepts ISO 639-1 two-letter codes ("en", "ar"), NLTK names ("english", "arabic"), lists (["en", "ar"]), or None (auto-detect per document via detect_script). Forwarded to the reader; the enricher uses its own language config when set. Default: None.

reader_kwargsdict or None, optional

Extra keyword arguments forwarded to every reader constructed by this pipeline — both create (used by run and run_batch) and from_url (used by run_url). Default: None.

Audio / video URL transcription — forward Whisper kwargs directly so run_url on an .mp3 URL transcribes it:

pipeline = CorpusPipeline(
    reader_kwargs={
        "transcribe": True,
        "whisper_model": "small",  # "tiny" / "base" / "medium" / "large"
    },
)
result = pipeline.run_url("https://archive.org/details/.../episode.mp3")

ZIP archive with per-extension overrides — when the source is a .zip file, reader_kwargs is forwarded to ZipReader. Pass a nested "reader_kwargs" key to control individual member types:

pipeline = CorpusPipeline(
    reader_kwargs={
        "reader_kwargs": {
            ".mp3": {"transcribe": True, "whisper_model": "small"},
            ".jpg": {"backend": "easyocr"},
        },
    },
)
result = pipeline.run(Path("WHO-EURO-2025.zip"))

Single-type files — for a pipeline that only processes audio files (no ZIP), pass the kwargs flat:

pipeline = CorpusPipeline(
    reader_kwargs={"transcribe": True, "whisper_model": "base"},
)
result = pipeline.run(Path("podcast.mp3"))

Attributes:

chunkerChunkerBase or None
filter_FilterBase or None
embedding_engineEmbeddingEngine or None
output_dirpathlib.Path or None
export_formatExportFormat or None
default_languagestr or None

Parameters:

chunker (ChunkerBase | None)
filter_ (FilterBase | None)
embedding_engine (Any | None)
output_dir (pathlib.Path | None)
export_format (ExportFormat | None)
normalizer (TextNormalizer | None)
enricher (NLPEnricher | None)
default_language (str | list | None)
progress_callback (Callable[[str, int, int], None] | None)
reader_kwargs (dict[str, Any] | None)

See also

scikitplot.corpus._export.export_documents: Low-level export function.
scikitplot.corpus._embeddings.EmbeddingEngine: Embedding backend.

Notes

Thread safety: CorpusPipeline is not thread-safe. Run one instance per thread, or use run_batch (which processes files sequentially, not in parallel).

Embedding and caching: When embedding_engine is provided, embeddings are cached to disk using the source file path and mtime as the cache key. URL sources disable caching (no stable mtime).

Examples

Basic single-file run:

>>> from pathlib import Path
>>> from scikitplot.corpus._pipeline import CorpusPipeline
>>> from scikitplot.corpus._chunkers import SentenceChunker
>>> pipeline = CorpusPipeline(
...     chunker=SentenceChunker("en_core_web_sm"),
...     output_dir=Path("output/"),
... )
>>> result = pipeline.run(Path("corpus.txt"))
>>> print(result)

Batch processing with embeddings:

>>> from scikitplot.corpus._embeddings import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
...     chunker=SentenceChunker("en_core_web_sm"),
...     embedding_engine=engine,
...     output_dir=Path("output/"),
...     export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))

URL ingestion:

>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")

Audio URL transcription via reader_kwargs:

>>> pipeline = CorpusPipeline(
...     reader_kwargs={"transcribe": True, "whisper_model": "small"},
...     output_dir=Path("output/"),
... )
>>> result = pipeline.run_url(
...     "https://archive.org/details/tale_two_cities_librivox/"
...     "tale_of_two_cities_01_dickens.mp3"
... )

ZIP archive with per-extension kwargs:

>>> pipeline = CorpusPipeline(
...     reader_kwargs={
...         "reader_kwargs": {
...             ".mp3": {"transcribe": True, "whisper_model": "small"},
...             ".jpg": {"backend": "easyocr"},
...         },
...     },
...     output_dir=Path("output/"),
... )
>>> result = pipeline.run(Path("WHO-EURO-2025.zip"))

run(input_file, *, output_path=None, export_format=None, filename_override=None)[source]#

Process a single source and return a PipelineResult.

Accepts a local file path or an http(s):// URL string. URL detection is performed before any pathlib.Path conversion, so passing a URL string routes correctly to the web/YouTube/audio reader rather than crashing with a “file not found” error.

Parameters:

input_filepathlib.Path or str: Path to a local file or an http(s):// URL string. A str that starts with http:// or https:// (case-insensitive) is treated as a URL and routed through from_url; all other values are treated as local file paths and dispatched by extension via the reader registry.
output_pathpathlib.Path or None, optional: Explicit output file path. When None, the path is derived from output_dir and the input stem. If both are None, export is skipped.
export_formatExportFormat or None, optional: Override the pipeline-level export_format for this call.
filename_overridestr or None, optional: Override the source_file label in generated documents. Ignored for URL sources.

Returns:

PipelineResult: Result summary including the document list.

Raises:

TypeError: If input_file is not a str or pathlib.Path.
ValueError: If a local file path does not exist, or no reader is registered for the file extension.
ValueError: If input_file is a URL string and the URL is invalid or cannot be resolved.

Parameters: