scikitplot.corpus#
scikitplot.corpus#
A production-grade document corpus ingestion, chunking, filtering, embedding, and export pipeline for NLP and ML workflows.
This package is a ground-up rewrite of the remarx.sentence.corpus
module, preserving all proven design patterns while resolving every known
correctness, robustness, and maintainability issue identified during the
migration audit.
Quick start#
Single file, no embedding:
>>> from pathlib import Path
>>> from scikitplot.corpus import CorpusPipeline, ParagraphChunker
>>> pipeline = CorpusPipeline(chunker=ParagraphChunker())
>>> result = pipeline.run(Path("article.txt"))
>>> print(f"{result.n_documents} chunks from {result.source}")
Batch processing with sentence chunking:
>>> from scikitplot.corpus import CorpusPipeline, SentenceChunker, ExportFormat
>>> pipeline = CorpusPipeline(
... chunker=SentenceChunker("en_core_web_sm"),
... output_dir=Path("output/"),
... export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))
URL ingestion:
>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")
YouTube transcript:
>>> result = pipeline.run_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
Image OCR:
>>> reader = DocumentReader.create(Path("scan.png"))
>>> docs = list(reader.get_documents())
Video transcription (subtitle-first):
>>> reader = DocumentReader.create(Path("lecture.mp4"))
>>> docs = list(reader.get_documents())
With embeddings:
>>> from scikitplot.corpus import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
... chunker=ParagraphChunker(),
... embedding_engine=engine,
... )
>>> result = pipeline.run(Path("article.txt"))
>>> result.documents[0].has_embedding
True
Convenience function (direct replacement for remarx create_corpus):
>>> from scikitplot.corpus import create_corpus
>>> result = create_corpus(
... input_file=Path("chapter01.txt"),
... output_path=Path("output/chapter01.csv"),
... )
>>> from scikitplot.corpus import CorpusBuilder, BuilderConfig
>>> builder = CorpusBuilder(
... BuilderConfig(
... chunker="paragraph",
... normalize=True,
... enrich=True,
... embed=True,
... build_index=True,
... )
... )
>>> result = builder.build("./data/")
>>> results = builder.search("quantum computing")
>>> lc_docs = builder.to_langchain()
>>> mcp_response = builder.to_mcp_tool_result("quantum computing")
Package structure#
scikitplot.corpus._schemaCore data types: CorpusDocument, SectionType, ChunkingStrategy, ExportFormat, SourceType, MatchMode.
scikitplot.corpus._baseAbstract bases: DocumentReader, ChunkerBase, FilterBase, DefaultFilter.
scikitplot.corpus._chunkersSentenceChunker, ParagraphChunker, FixedWindowChunker.
scikitplot.corpus._readersTextReader, MarkdownReader, ReSTReader, XMLReader, TEIReader, AudioReader, ALTOReader, PDFReader, ImageReader, VideoReader, WebReader, YouTubeReader.
scikitplot.corpus._embeddingsEmbeddingEngine – multi-backend embedding with disk cache.
scikitplot.corpus._pipelineCorpusPipeline, PipelineResult, create_corpus.
scikitplot.corpus._exportexport_documents, load_documents.
User guide. See the Corpus Generation section for further details.
Base#
Abstract base class for all text chunkers. |
|
Standard noise filter ported and improved from remarx's |
|
Abstract base class for all format-specific document readers. |
|
Abstract base class for corpus document filters. |
Readers#
ALTO XML reader for scanned document archives. |
|
Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification. |
|
OCR-based text extraction from raster image files. |
|
Markdown document reader. |
|
PDF document reader with pdfminer.six → pypdf cascade. |
|
reStructuredText document reader. |
|
TEI/XML document reader with dramatic structure extraction. |
|
Plain-text document reader. |
|
Text extraction from video files via subtitle parsing and/or automatic speech recognition. |
|
Fetch a web page and extract structured text via BeautifulSoup. |
|
Generic XML document reader with configurable XPath. |
|
Extract the transcript of a YouTube video using |
Chunkers#
Produce fixed-size sliding-window chunks over a document. |
|
Split a document into paragraph-level |
|
Split a document into sentence-level |
|
Process a document at word level, producing normalised token chunks. |
Normalizers#
Remove exact duplicate lines while preserving first-occurrence order. |
|
Remove HTML and XML tags from the document text. |
|
Detect document language and set |
|
Convert the document text to lowercase. |
|
Apply a sequence of normalisers in order. |
|
Abstract base class for all text normalisers. |
|
Configuration for |
|
Pipeline component that populates |
|
Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD). |
|
Collapse runs of whitespace and optionally strip leading/trailing space. |
|
Normalise text according to config. |
Enricher#
Configuration for |
|
Pipeline component that populates NLP enrichment fields on |
Embeddings#
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
Multi-backend sentence embedding engine with SHA-256 file caching. |
Similarity#
Configuration for similarity search. |
|
A single search result. |
|
Multi-mode similarity index over |
Pipeline#
Orchestrates the full corpus ingestion pipeline. |
|
Immutable summary of a single pipeline run. |
|
Create and export a corpus from a single source file. |