CorpusPipeline#
- class scikitplot.corpus.CorpusPipeline(chunker=None, filter_=None, embedding_engine=None, output_dir=None, export_format=ExportFormat.CSV, default_language=None, progress_callback=None, reader_kwargs=None)[source]#
Orchestrates the full corpus ingestion pipeline.
Instantiate once, then call
run(single file),run_batch(multiple files), orrun_url(URL source) any number of times. The pipeline is stateless between calls; all configuration is set at construction time.- Parameters:
- chunkerChunkerBase or None, optional
Chunker to inject into every reader.
Noneyields oneCorpusDocumentper raw chunk. Default:None.- filter_FilterBase or None, optional
Filter applied after chunking.
NoneusesDefaultFilter. Default:None.- embedding_engineEmbeddingEngine or None, optional
When provided, documents are embedded in batches after chunking/filtering. Embeddings are stored in
embedding. Default:None(no embedding).- output_dirpathlib.Path or None, optional
Directory where exported files are written. When
None, export is skipped unlessoutput_pathis supplied explicitly in aruncall. Default:None.- export_formatExportFormat or None, optional
Default export format. Individual
runcalls can override. Default:CSV.- default_languagestr or None, optional
ISO 639-1 language code applied to all documents when the reader cannot detect language. Default:
None.- progress_callbackcallable or None, optional
Called after each batch of documents is processed. Signature:
(source: str, n_done: int, n_total_estimate: int) → None.n_total_estimateis-1when the total is unknown. Default:None.- reader_kwargsdict or None, optional
Extra keyword arguments forwarded to
create. Default:None.
- Attributes:
- chunkerChunkerBase or None
- filter_FilterBase or None
- embedding_engineEmbeddingEngine or None
- output_dirpathlib.Path or None
- export_formatExportFormat or None
- default_languagestr or None
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
embedding_engine (Any | None)
output_dir (pathlib.Path | None)
export_format (ExportFormat | None)
default_language (str | None)
See also
scikitplot.corpus._export.export_documentsLow-level export function.
scikitplot.corpus._embeddings.EmbeddingEngineEmbedding backend.
Notes
Thread safety:
CorpusPipelineis not thread-safe. Run one instance per thread, or userun_batch(which processes files sequentially, not in parallel).Embedding and caching: When
embedding_engineis provided, embeddings are cached to disk using the source file path and mtime as the cache key. URL sources disable caching (no stable mtime).Examples
Basic single-file run:
>>> from pathlib import Path >>> from scikitplot.corpus._pipeline import CorpusPipeline >>> from scikitplot.corpus._chunkers import SentenceChunker >>> pipeline = CorpusPipeline( ... chunker=SentenceChunker("en_core_web_sm"), ... output_dir=Path("output/"), ... ) >>> result = pipeline.run(Path("corpus.txt")) >>> print(result)
Batch processing with embeddings:
>>> from scikitplot.corpus._embeddings import EmbeddingEngine >>> engine = EmbeddingEngine(backend="sentence_transformers") >>> pipeline = CorpusPipeline( ... chunker=SentenceChunker("en_core_web_sm"), ... embedding_engine=engine, ... output_dir=Path("output/"), ... export_format=ExportFormat.PARQUET, ... ) >>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))
URL ingestion:
>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")
- run(input_file, *, output_path=None, export_format=None, filename_override=None)[source]#
Process a single file and return a
PipelineResult.- Parameters:
- input_filepathlib.Path or str
Path to the input file.
- output_pathpathlib.Path or None, optional
Explicit output file path. When
None, the path is derived fromoutput_dirand the input stem. If both areNone, export is skipped.- export_formatExportFormat or None, optional
Override the pipeline-level
export_formatfor this call.- filename_overridestr or None, optional
Override the
source_filelabel in generated documents.
- Returns:
- PipelineResult
Result summary including the document list.
- Raises:
- ValueError
If the input file does not exist.
- ValueError
If no reader is registered for the file extension.
- Parameters:
- Return type:
Examples
>>> result = pipeline.run(Path("chapter01.txt")) >>> len(result.documents) 312
- run_batch(input_files, *, stop_on_error=False, export_format=None)[source]#
Process multiple files sequentially.
- Parameters:
- input_fileslist of pathlib.Path or str
Paths to process in order.
- stop_on_errorbool, optional
When
False(default), errors on individual files are logged as warnings and processing continues. WhenTrue, the first error is re-raised immediately.- export_formatExportFormat or None, optional
Override the pipeline-level
export_formatfor all files in this batch.
- Returns:
- list of PipelineResult
One result per successfully processed file. Failed files (when
stop_on_error=False) are omitted from the list.
- Parameters:
- Return type:
Examples
>>> paths = list(Path("corpus/").glob("*.txt")) >>> results = pipeline.run_batch(paths) >>> total_docs = sum(r.n_documents for r in results)
- run_url(url, *, output_path=None, export_format=None)[source]#
Process a URL source (web page or YouTube video).
- Parameters:
- urlstr
Full URL string. Dispatched to
WebReaderorYouTubeReaderviafrom_url.- output_pathpathlib.Path or None, optional
Explicit output file path. When
Noneandoutput_diris set, a filename is derived from the URL host/path.- export_formatExportFormat or None, optional
Override the pipeline-level
export_formatfor this call.
- Returns:
- PipelineResult
- Raises:
- ValueError
If
urldoes not start withhttp://orhttps://.- ImportError
If
scikitplot.corpus._readershas not been imported yet.
- Parameters:
- Return type:
Examples
>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python") >>> len(result.documents) 58