scikitplot.corpus#
scikitplot.corpus#
A production-grade document corpus ingestion, chunking, filtering, embedding, and export pipeline for NLP and ML workflows.
This package is a ground-up rewrite of the remarx.sentence.corpus
module, preserving all proven design patterns while resolving every known
correctness, robustness, and maintainability issue identified during the
migration audit.
Standardized NLP/ML Workflow:
Sourcing → Reading → Chunking → Filtering → Normalizing → Embedding → Exporting.
Examples
Single file, no embedding:
>>> from pathlib import Path
>>> from scikitplot.corpus import CorpusPipeline, ParagraphChunker
>>> pipeline = CorpusPipeline(chunker=ParagraphChunker())
>>> result = pipeline.run(Path("article.txt"))
>>> print(f"{result.n_documents} chunks from {result.source}")
Batch processing with sentence chunking:
>>> from scikitplot.corpus import CorpusPipeline, SentenceChunker, ExportFormat
>>> pipeline = CorpusPipeline(
... # chunker=SentenceChunker(SentenceChunkerConfig(backend=SentenceBackend.NLTK)),
... chunker=SentenceChunker("en_core_web_sm"), # default backend spacy
... output_dir=Path("output/"),
... export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))
URL ingestion:
>>> # https://archive.org/download/WHO-documents
>>> # https://www.who.int/europe/news/item/...
>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")
YouTube transcript:
>>> result = pipeline.run("https://www.youtube.com/watch?v=rwPISgZcYIk")
Image OCR:
>>> reader = DocumentReader.create(Path("scan.png"))
>>> docs = list(reader.get_documents())
Video transcription (subtitle-first):
>>> # Richard Feynman - The Character of Physical Law (1964) - Complete - Better Audio
>>> # https://www.youtube.com/watch?v=kEx-gRfuhhk
>>> reader = DocumentReader.create(Path("lecture.mp4"))
>>> docs = list(reader.get_documents())
With embeddings:
>>> from scikitplot.corpus import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
... chunker=ParagraphChunker(),
... embedding_engine=engine,
... )
>>> result = pipeline.run(Path("article.txt"))
>>> result.documents[0].has_embedding
True
Convenience function (direct replacement for remarx create_corpus):
>>> from scikitplot.corpus import create_corpus
>>> result = create_corpus(
... input_file=Path("chapter01.txt"),
... output_path=Path("output/chapter01.csv"),
... )
>>> from scikitplot.corpus import CorpusBuilder, BuilderConfig
>>> builder = CorpusBuilder(
... BuilderConfig(
... chunker="paragraph",
... normalize=True,
... enrich=True,
... embed=True,
... build_index=True,
... )
... )
>>> result = builder.build("./data/")
>>> results = builder.search("quantum computing")
>>> lc_docs = builder.to_langchain()
>>> mcp_response = builder.to_mcp_tool_result("quantum computing")
User guide. See the Corpus Generation section for further details.
Adapter layer#
Convert |
|
Convert documents to a LangGraph-compatible state dict. |
|
Convert documents to MCP |
|
Format documents as an MCP |
|
Convert documents to a HuggingFace |
|
Convert documents to |
|
Yield documents as newline-delimited JSON strings. |
|
Convert documents to a dict of NumPy arrays suitable for batch ML. |
|
Convert documents to a |
|
Convert documents to a |
|
LangChain-compatible retriever backed by |
|
MCP server adapter for corpus search. |
Archive-within-archive#
Extract an archive to a destination directory. |
|
Check if a file path has a supported archive extension. |
Base Classes#
Abstract base class for all text chunkers. |
|
Standard noise filter ported and improved from remarx's |
|
Abstract base class for all format-specific document readers. |
|
A no-op reader that validates source existence and accessibility. |
|
Abstract base class for corpus document filters. |
|
Wrap any document stream with resilience, deduplication, and checkpointing. |
|
Chains multiple |
|
Return |
Chunkers#
Adapter that wraps a new-style chunker as a |
|
Bridge for |
|
Bridge for |
|
Bridge for |
|
Bridge for |
|
Wrap chunker in a bridge if it is a new-style chunker. |
|
Register a custom bridge for a user-defined chunker class. |
|
Remove a previously registered bridge for chunker_class. |
|
Structural protocol for word tokenizers. |
|
Structural protocol for sentence segmenters. |
|
Structural protocol for word stemmers. |
|
Structural protocol for word lemmatizers. |
|
Wrap any |
|
Wrap any |
|
Wrap any |
|
Wrap any |
|
Thread-safe(ish) module-level registry for named custom components. |
|
Register a named |
|
Retrieve a registered tokenizer by name. |
|
Register a named |
|
Retrieve a registered sentence splitter by name. |
|
Register a named |
|
Retrieve a registered stemmer by name. |
|
Register a named |
|
Retrieve a registered lemmatizer by name. |
|
Dominant Unicode script detected in a text sample. |
|
Detect the dominant Unicode script in text. |
|
Return |
|
Return |
|
Split text into individual CJK character tokens. |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
Produce fixed-size sliding-window chunks over a document. |
|
Configuration for |
|
Unit of measurement for window size and step. |
|
frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object |
|
Normalise any language specifier into a list of canonical NLTK names. |
|
Return a frozenset of stopwords for one or more languages. |
|
Resolve an ISO 639-1/639-3 code to a canonical NLTK language name. |
|
Resolve a canonical NLTK language name to its primary ISO 639-1 code. |
|
Split a document into paragraph-level |
|
Configuration for |
|
Supported sentence-splitting backends. |
|
Split a document into sentence-level |
|
Configuration for |
|
Lemmatization backend. |
|
Stemming algorithm. |
|
Stopword list source. |
|
Word tokenisation backend. |
|
Process a document at word level, producing normalised token chunks. |
|
Configuration for |
Corpus Builder#
Result of a corpus build operation. |
|
Configuration for |
|
Unified corpus builder — end-to-end pipeline orchestrator. |
Custom Hooks#
Component factory callables for |
|
Wrap any callable as a |
|
Custom backend callables for |
|
Wrap any callable as a |
|
|
|
Wrap any callable as a |
|
|
|
|
|
|
|
Lifecycle callbacks for |
Embeddings#
Path subclass for non-Windows systems. |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
Multi-backend sentence embedding engine with SHA-256 file caching. |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
Export a corpus with embeddings to LLM training formats. |
|
Unified embedding engine for any |
Enricher#
frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object |
|
Configuration for |
|
Pipeline component that populates NLP enrichment fields on |
Export#
Export a list of documents to |
|
Load |
Metadata#
Descriptor for a named corpus collection. |
|
Aggregate statistics over a |
|
Compute aggregate statistics over a document collection. |
|
Extract provenance metadata from a source filename using heuristics. |
Normalizers#
Remove exact duplicate lines while preserving first-occurrence order. |
|
Remove HTML and XML tags from the document text. |
|
Detect document language and set |
|
Convert the document text to lowercase. |
|
Apply a sequence of normalisers in order. |
|
Abstract base class for all text normalisers. |
|
Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD). |
|
Collapse runs of whitespace and optionally strip leading/trailing space. |
|
Configuration for |
|
Pipeline component that populates |
|
Normalise text according to config. |
Pipeline#
Orchestrates the full corpus ingestion pipeline. |
|
Immutable summary of a single pipeline run. |
|
Create and export a corpus from a single source file. |
Readers#
ALTO XML reader for scanned document archives. |
|
Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification. |
|
Fully user-customizable reader for any file extension and resource type. |
|
Coerce an extractor return value to a list of raw chunk dicts. |
|
OCR-based text extraction from raster image files. |
|
PDF document reader with pdfminer.six → pypdf cascade. |
|
Markdown document reader. |
|
reStructuredText document reader. |
|
Plain-text document reader. |
|
Text extraction from video files via subtitle parsing and/or automatic speech recognition. |
|
Fetch a web page and extract structured text via BeautifulSoup. |
|
Extract the transcript of a YouTube video using |
|
TEI/XML document reader with dramatic structure extraction. |
|
Generic XML document reader with configurable XPath. |
|
Generic ZIP archive reader — dispatches each member to its natural reader. |
Registry#
Central look-up table for corpus pipeline components. |
|
Central look-up table for corpus pipeline components. |
Similarity#
Configuration for similarity search. |
|
A single search result. |
|
Multi-mode similarity index over |
Schema#
Semantic label for the role of a text chunk within its source document. |
|
Describes how a |
|
Supported serialisation targets for a completed corpus. |
|
Semantic label for the kind of source from which a document was read. |
|
Search mode for intertextual matching queries against a corpus index. |
|
Primary content modality of a |
|
Per-document error handling strategy for |
|
Canonical representation of a single text chunk in a processed corpus. |
|
frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object |
|
Convert a list of |
|
Convert a list of |
Source#
Declarative descriptor for one or more document sources. |
|
A single resolved source entry yielded by |
|
Discriminant for the kind of source an entry represents. |
Storage#
Thread-safe in-memory dict store. |
|
Append-friendly JSONL (newline-delimited JSON) flat-file store. |
|
Result container returned by |
|
SQLite-backed corpus store with FTS5 full-text search. |
|
Abstract base class for all corpus storage backends. |
|
Query parameters for |
URL#
Classification of a URL for routing to the correct handler. |
|
Classify a URL into one of the known |
|
Download a URL to a local file. |
|
Infer a file extension from HTTP response headers and URL path. |
|
Probe a URL with a HEAD request to classify by Content-Type. |
|
Resolve a provider-specific URL to a direct-download URL. |