CorpusPipeline#

class scikitplot.corpus.CorpusPipeline(chunker=None, filter_=None, embedding_engine=None, output_dir=None, export_format=ExportFormat.CSV, default_language=None, progress_callback=None, reader_kwargs=None)[source]#

Orchestrates the full corpus ingestion pipeline.

Instantiate once, then call run (single file), run_batch (multiple files), or run_url (URL source) any number of times. The pipeline is stateless between calls; all configuration is set at construction time.

Parameters:
chunkerChunkerBase or None, optional

Chunker to inject into every reader. None yields one CorpusDocument per raw chunk. Default: None.

filter_FilterBase or None, optional

Filter applied after chunking. None uses DefaultFilter. Default: None.

embedding_engineEmbeddingEngine or None, optional

When provided, documents are embedded in batches after chunking/filtering. Embeddings are stored in embedding. Default: None (no embedding).

output_dirpathlib.Path or None, optional

Directory where exported files are written. When None, export is skipped unless output_path is supplied explicitly in a run call. Default: None.

export_formatExportFormat or None, optional

Default export format. Individual run calls can override. Default: CSV.

default_languagestr or None, optional

ISO 639-1 language code applied to all documents when the reader cannot detect language. Default: None.

progress_callbackcallable or None, optional

Called after each batch of documents is processed. Signature: (source: str, n_done: int, n_total_estimate: int) None. n_total_estimate is -1 when the total is unknown. Default: None.

reader_kwargsdict or None, optional

Extra keyword arguments forwarded to create. Default: None.

Attributes:
chunkerChunkerBase or None
filter_FilterBase or None
embedding_engineEmbeddingEngine or None
output_dirpathlib.Path or None
export_formatExportFormat or None
default_languagestr or None
Parameters:
  • chunker (ChunkerBase | None)

  • filter_ (FilterBase | None)

  • embedding_engine (Any | None)

  • output_dir (pathlib.Path | None)

  • export_format (ExportFormat | None)

  • default_language (str | None)

  • progress_callback (Callable[[str, int, int], None] | None)

  • reader_kwargs (dict[str, Any] | None)

See also

scikitplot.corpus._export.export_documents

Low-level export function.

scikitplot.corpus._embeddings.EmbeddingEngine

Embedding backend.

Notes

Thread safety: CorpusPipeline is not thread-safe. Run one instance per thread, or use run_batch (which processes files sequentially, not in parallel).

Embedding and caching: When embedding_engine is provided, embeddings are cached to disk using the source file path and mtime as the cache key. URL sources disable caching (no stable mtime).

Examples

Basic single-file run:

>>> from pathlib import Path
>>> from scikitplot.corpus._pipeline import CorpusPipeline
>>> from scikitplot.corpus._chunkers import SentenceChunker
>>> pipeline = CorpusPipeline(
...     chunker=SentenceChunker("en_core_web_sm"),
...     output_dir=Path("output/"),
... )
>>> result = pipeline.run(Path("corpus.txt"))
>>> print(result)

Batch processing with embeddings:

>>> from scikitplot.corpus._embeddings import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
...     chunker=SentenceChunker("en_core_web_sm"),
...     embedding_engine=engine,
...     output_dir=Path("output/"),
...     export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))

URL ingestion:

>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")
run(input_file, *, output_path=None, export_format=None, filename_override=None)[source]#

Process a single file and return a PipelineResult.

Parameters:
input_filepathlib.Path or str

Path to the input file.

output_pathpathlib.Path or None, optional

Explicit output file path. When None, the path is derived from output_dir and the input stem. If both are None, export is skipped.

export_formatExportFormat or None, optional

Override the pipeline-level export_format for this call.

filename_overridestr or None, optional

Override the source_file label in generated documents.

Returns:
PipelineResult

Result summary including the document list.

Raises:
ValueError

If the input file does not exist.

ValueError

If no reader is registered for the file extension.

Parameters:
  • input_file (Path | str)

  • output_path (Path | None)

  • export_format (ExportFormat | None)

  • filename_override (str | None)

Return type:

PipelineResult

Examples

>>> result = pipeline.run(Path("chapter01.txt"))
>>> len(result.documents)
312
run_batch(input_files, *, stop_on_error=False, export_format=None)[source]#

Process multiple files sequentially.

Parameters:
input_fileslist of pathlib.Path or str

Paths to process in order.

stop_on_errorbool, optional

When False (default), errors on individual files are logged as warnings and processing continues. When True, the first error is re-raised immediately.

export_formatExportFormat or None, optional

Override the pipeline-level export_format for all files in this batch.

Returns:
list of PipelineResult

One result per successfully processed file. Failed files (when stop_on_error=False) are omitted from the list.

Parameters:
  • input_files (list[Path | str])

  • stop_on_error (bool)

  • export_format (ExportFormat | None)

Return type:

list[PipelineResult]

Examples

>>> paths = list(Path("corpus/").glob("*.txt"))
>>> results = pipeline.run_batch(paths)
>>> total_docs = sum(r.n_documents for r in results)
run_url(url, *, output_path=None, export_format=None)[source]#

Process a URL source (web page or YouTube video).

Parameters:
urlstr

Full URL string. Dispatched to WebReader or YouTubeReader via from_url.

output_pathpathlib.Path or None, optional

Explicit output file path. When None and output_dir is set, a filename is derived from the URL host/path.

export_formatExportFormat or None, optional

Override the pipeline-level export_format for this call.

Returns:
PipelineResult
Raises:
ValueError

If url does not start with http:// or https://.

ImportError

If scikitplot.corpus._readers has not been imported yet.

Parameters:
  • url (str)

  • output_path (Path | None)

  • export_format (ExportFormat | None)

Return type:

PipelineResult

Examples

>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")
>>> len(result.documents)
58