scikitplot.corpus#

scikitplot.corpus#

A production-grade document corpus ingestion, chunking, filtering, embedding, and export pipeline for NLP and ML workflows.

This package is a ground-up rewrite of the remarx.sentence.corpus module, preserving all proven design patterns while resolving every known correctness, robustness, and maintainability issue identified during the migration audit.

Standardized NLP/ML Workflow: Sourcing Reading Chunking Filtering Normalizing Embedding Exporting.

Examples

Single file, no embedding:

>>> from pathlib import Path
>>> from scikitplot.corpus import CorpusPipeline, ParagraphChunker
>>> pipeline = CorpusPipeline(chunker=ParagraphChunker())
>>> result = pipeline.run(Path("article.txt"))
>>> print(f"{result.n_documents} chunks from {result.source}")

Batch processing with sentence chunking:

>>> from scikitplot.corpus import CorpusPipeline, SentenceChunker, ExportFormat
>>> pipeline = CorpusPipeline(
...     # chunker=SentenceChunker(SentenceChunkerConfig(backend=SentenceBackend.NLTK)),
...     chunker=SentenceChunker("en_core_web_sm"),  # default backend spacy
...     output_dir=Path("output/"),
...     export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))

URL ingestion:

>>> # https://archive.org/download/WHO-documents
>>> # https://www.who.int/europe/news/item/...
>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")

YouTube transcript:

>>> result = pipeline.run("https://www.youtube.com/watch?v=rwPISgZcYIk")

Image OCR:

>>> reader = DocumentReader.create(Path("scan.png"))
>>> docs = list(reader.get_documents())

Video transcription (subtitle-first):

>>> # Richard Feynman - The Character of Physical Law (1964) - Complete - Better Audio
>>> # https://www.youtube.com/watch?v=kEx-gRfuhhk
>>> reader = DocumentReader.create(Path("lecture.mp4"))
>>> docs = list(reader.get_documents())

With embeddings:

>>> from scikitplot.corpus import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
...     chunker=ParagraphChunker(),
...     embedding_engine=engine,
... )
>>> result = pipeline.run(Path("article.txt"))
>>> result.documents[0].has_embedding
True

Convenience function (direct replacement for remarx create_corpus):

>>> from scikitplot.corpus import create_corpus
>>> result = create_corpus(
...     input_file=Path("chapter01.txt"),
...     output_path=Path("output/chapter01.csv"),
... )
>>> from scikitplot.corpus import CorpusBuilder, BuilderConfig
>>> builder = CorpusBuilder(
...     BuilderConfig(
...         chunker="paragraph",
...         normalize=True,
...         enrich=True,
...         embed=True,
...         build_index=True,
...     )
... )
>>> result = builder.build("./data/")
>>> results = builder.search("quantum computing")
>>> lc_docs = builder.to_langchain()
>>> mcp_response = builder.to_mcp_tool_result("quantum computing")

User guide. See the Corpus Generation section for further details.

Adapter layer#

to_langchain_documents

Convert CorpusDocument instances to LangChain Document.

to_langgraph_state

Convert documents to a LangGraph-compatible state dict.

to_mcp_resources

Convert documents to MCP resources/read response format.

to_mcp_tool_result

Format documents as an MCP tools/call response.

to_huggingface_dataset

Convert documents to a HuggingFace Dataset.

to_rag_tuples

Convert documents to (text, metadata, embedding) tuples.

to_jsonl

Yield documents as newline-delimited JSON strings.

to_numpy_arrays

Convert documents to a dict of NumPy arrays suitable for batch ML.

to_tensorflow_dataset

Convert documents to a tf.data.Dataset.

to_torch_dataloader

Convert documents to a torch.utils.data.DataLoader.

LangChainCorpusRetriever

LangChain-compatible retriever backed by SimilarityIndex.

MCPCorpusServer

MCP server adapter for corpus search.

Archive-within-archive#

extract_archive

Extract an archive to a destination directory.

is_archive

Check if a file path has a supported archive extension.

Base Classes#

ChunkerBase

Abstract base class for all text chunkers.

DefaultFilter

Standard noise filter ported and improved from remarx's include_sentence.

DocumentReader

Abstract base class for all format-specific document readers.

DummyReader

A no-op reader that validates source existence and accessibility.

FilterBase

Abstract base class for corpus document filters.

PipelineGuard

Wrap any document stream with resilience, deduplication, and checkpointing.

_MultiSourceReader

Chains multiple DocumentReader instances into one stream.

_is_url

Return True if s is a string that looks like an HTTP(S) URL.

Chunkers#

ChunkerBridge

Adapter that wraps a new-style chunker as a ChunkerBase- compatible object.

FixedWindowChunkerBridge

Bridge for FixedWindowChunkerChunkerBase contract.

ParagraphChunkerBridge

Bridge for ParagraphChunkerChunkerBase contract.

SentenceChunkerBridge

Bridge for SentenceChunkerChunkerBase contract.

WordChunkerBridge

Bridge for WordChunkerChunkerBase contract.

bridge_chunker

Wrap chunker in a bridge if it is a new-style chunker.

register_bridge

Register a custom bridge for a user-defined chunker class.

unregister_bridge

Remove a previously registered bridge for chunker_class.

TokenizerProtocol

Structural protocol for word tokenizers.

SentenceSplitterProtocol

Structural protocol for sentence segmenters.

StemmerProtocol

Structural protocol for word stemmers.

LemmatizerProtocol

Structural protocol for word lemmatizers.

FunctionTokenizer

Wrap any Callable[[str], list[str]] as a TokenizerProtocol.

FunctionSentenceSplitter

Wrap any Callable[[str], list[str]] as a SentenceSplitterProtocol.

FunctionStemmer

Wrap any Callable[[str], str] as a StemmerProtocol.

FunctionLemmatizer

Wrap any Callable[[str, Optional[str]], str] as a LemmatizerProtocol.

CustomTokenizerRegistry

Thread-safe(ish) module-level registry for named custom components.

register_tokenizer

Register a named TokenizerProtocol implementation.

get_tokenizer

Retrieve a registered tokenizer by name.

register_sentence_splitter

Register a named SentenceSplitterProtocol implementation.

get_sentence_splitter

Retrieve a registered sentence splitter by name.

register_stemmer

Register a named StemmerProtocol implementation.

get_stemmer

Retrieve a registered stemmer by name.

register_lemmatizer

Register a named LemmatizerProtocol implementation.

get_lemmatizer

Retrieve a registered lemmatizer by name.

ScriptType

Dominant Unicode script detected in a text sample.

detect_script

Detect the dominant Unicode script in text.

is_cjk_char

Return True if ch is a CJK / Japanese / Korean character.

is_rtl_char

Return True if ch belongs to a right-to-left script.

split_cjk_chars

Split text into individual CJK character tokens.

MULTI_SCRIPT_SENTENCE_RE_PATTERN

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

FixedWindowChunker

Produce fixed-size sliding-window chunks over a document.

FixedWindowChunkerConfig

Configuration for FixedWindowChunker.

WindowUnit

Unit of measurement for window size and step.

ISO_TO_NLTK

ISO_TO_NAME

NLTK_TO_ISO

NLTK_STOPWORD_LANGUAGES

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

BUILTIN_LANG_STOPWORDS

coerce_language

Normalise any language specifier into a list of canonical NLTK names.

resolve_stopwords

Return a frozenset of stopwords for one or more languages.

iso_to_nltk

Resolve an ISO 639-1/639-3 code to a canonical NLTK language name.

nltk_to_iso

Resolve a canonical NLTK language name to its primary ISO 639-1 code.

ParagraphChunker

Split a document into paragraph-level Chunk objects.

ParagraphChunkerConfig

Configuration for ParagraphChunker.

SentenceBackend

Supported sentence-splitting backends.

SentenceChunker

Split a document into sentence-level Chunk objects.

SentenceChunkerConfig

Configuration for SentenceChunker.

LemmatizationBackend

Lemmatization backend.

StemmingBackend

Stemming algorithm.

StopwordSource

Stopword list source.

TokenizerBackend

Word tokenisation backend.

WordChunker

Process a document at word level, producing normalised token chunks.

WordChunkerConfig

Configuration for WordChunker.

Corpus Builder#

BuildResult

Result of a corpus build operation.

BuilderConfig

Configuration for CorpusBuilder.

CorpusBuilder

Unified corpus builder — end-to-end pipeline orchestrator.

Custom Hooks#

BuilderFactories

Component factory callables for FactoryCorpusBuilder.

CustomChunker

Wrap any callable as a ChunkerBase.

CustomEnricherConfig

Custom backend callables for CustomNLPEnricher.

CustomFilter

Wrap any callable as a FilterBase.

CustomNLPEnricher

NLPEnricher extended with fully-replaceable NLP backends.

CustomNormalizer

Wrap any callable as a NormalizerBase.

CustomSimilarityIndex

SimilarityIndex extended with a fully-replaceable custom scorer callable.

FactoryCorpusBuilder

CorpusBuilder extended with pluggable component factories.

HookableCorpusPipeline

CorpusPipeline extended with per-stage lifecycle hooks.

PipelineHooks

Lifecycle callbacks for HookableCorpusPipeline.

Embeddings#

DEFAULT_CACHE_DIR

Path subclass for non-Windows systems.

DEFAULT_MODEL

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

EmbeddingEngine

Multi-backend sentence embedding engine with SHA-256 file caching.

DEFAULT_AUDIO_MODEL

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

DEFAULT_IMAGE_MODEL

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

DEFAULT_TEXT_MODEL

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

LLMTrainingExporter

Export a corpus with embeddings to LLM training formats.

MultimodalEmbeddingEngine

Unified embedding engine for any CorpusDocument modality — text, image, audio, video, or multimodal.

Enricher#

BUILTIN_STOPWORDS

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

EnricherConfig

Configuration for NLPEnricher.

NLPEnricher

Pipeline component that populates NLP enrichment fields on CorpusDocument.

Export#

export_documents

Export a list of documents to output_path in the given format.

load_documents

Load CorpusDocument instances from a previously exported file.

Metadata#

CollectionManifest

Descriptor for a named corpus collection.

CorpusStats

Aggregate statistics over a CorpusDocument collection.

compute_stats

Compute aggregate statistics over a document collection.

provenance_from_filename

Extract provenance metadata from a source filename using heuristics.

Normalizers#

DedupLinesNormalizer

Remove exact duplicate lines while preserving first-occurrence order.

HTMLStripNormalizer

Remove HTML and XML tags from the document text.

LanguageDetectionNormalizer

Detect document language and set CorpusDocument.language.

LowercaseNormalizer

Convert the document text to lowercase.

NormalizationPipeline

Apply a sequence of normalisers in order.

NormalizerBase

Abstract base class for all text normalisers.

UnicodeNormalizer

Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD).

WhitespaceNormalizer

Collapse runs of whitespace and optionally strip leading/trailing space.

NormalizerConfig

Configuration for TextNormalizer.

TextNormalizer

Pipeline component that populates normalized_text on CorpusDocument instances.

normalize_text

Normalise text according to config.

Pipeline#

CorpusPipeline

Orchestrates the full corpus ingestion pipeline.

PipelineResult

Immutable summary of a single pipeline run.

create_corpus

Create and export a corpus from a single source file.

Readers#

ALTOReader

ALTO XML reader for scanned document archives.

AudioReader

Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification.

CustomReader

Fully user-customizable reader for any file extension and resource type.

normalize_extractor_output

Coerce an extractor return value to a list of raw chunk dicts.

ImageReader

OCR-based text extraction from raster image files.

PDFReader

PDF document reader with pdfminer.six → pypdf cascade.

MarkdownReader

Markdown document reader.

ReSTReader

reStructuredText document reader.

TextReader

Plain-text document reader.

VideoReader

Text extraction from video files via subtitle parsing and/or automatic speech recognition.

WebReader

Fetch a web page and extract structured text via BeautifulSoup.

YouTubeReader

Extract the transcript of a YouTube video using youtube-transcript-api.

TEIReader

TEI/XML document reader with dramatic structure extraction.

XMLReader

Generic XML document reader with configurable XPath.

ZipReader

Generic ZIP archive reader — dispatches each member to its natural reader.

Registry#

ComponentRegistry

Central look-up table for corpus pipeline components.

registry

Central look-up table for corpus pipeline components.

Similarity#

SearchConfig

Configuration for similarity search.

SearchResult

A single search result.

SimilarityIndex

Multi-mode similarity index over CorpusDocument collections.

Schema#

SectionType

Semantic label for the role of a text chunk within its source document.

ChunkingStrategy

Describes how a CorpusDocument was segmented from raw text.

ExportFormat

Supported serialisation targets for a completed corpus.

SourceType

Semantic label for the kind of source from which a document was read.

MatchMode

Search mode for intertextual matching queries against a corpus index.

Modality

Primary content modality of a CorpusDocument.

ErrorPolicy

Per-document error handling strategy for PipelineGuard.

CorpusDocument

Canonical representation of a single text chunk in a processed corpus.

_PROMOTED_RAW_KEYS

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

documents_to_pandas

Convert a list of CorpusDocument instances to a pandas.DataFrame.

documents_to_polars

Convert a list of CorpusDocument instances to a polars.DataFrame.

Source#

CorpusSource

Declarative descriptor for one or more document sources.

SourceEntry

A single resolved source entry yielded by CorpusSource.iter_entries.

SourceKind

Discriminant for the kind of source an entry represents.

Storage#

InMemoryStorage

Thread-safe in-memory dict store.

JSONLStorage

Append-friendly JSONL (newline-delimited JSON) flat-file store.

QueryResult

Result container returned by StorageBase.query.

SQLiteStorage

SQLite-backed corpus store with FTS5 full-text search.

StorageBase

Abstract base class for all corpus storage backends.

StorageQuery

Query parameters for StorageBase.query.

URL#

URLKind

Classification of a URL for routing to the correct handler.

classify_url

Classify a URL into one of the known URLKind categories.

download_url

Download a URL to a local file.

infer_extension

Infer a file extension from HTTP response headers and URL path.

probe_url_kind

Probe a URL with a HEAD request to classify by Content-Type.

resolve_url

Resolve a provider-specific URL to a direct-download URL.