scikitplot.corpus#

A production-grade document corpus ingestion, chunking, filtering, embedding, and export pipeline for NLP and ML workflows.

This package is a ground-up rewrite of the remarx.sentence.corpus module, preserving all proven design patterns while resolving every known correctness, robustness, and maintainability issue identified during the migration audit.

Standardized NLP/ML Workflow: Sourcing → Reading → Chunking → Filtering → Normalizing → Embedding → Exporting.

Adapter layer#

`to_langchain_documents`	Convert `CorpusDocument` instances to LangChain `Document`.
`to_langgraph_state`	Convert documents to a LangGraph-compatible state dict.
`to_mcp_resources`	Convert documents to MCP `resources/read` response format.
`to_mcp_tool_result`	Format documents as an MCP `tools/call` response.
`to_huggingface_dataset`	Convert documents to a HuggingFace `Dataset`.
`to_rag_tuples`	Convert documents to `(text, metadata, embedding)` tuples.
`to_jsonl`	Yield documents as newline-delimited JSON strings.
`to_numpy_arrays`	Convert documents to a dict of NumPy arrays suitable for batch ML.
`to_tensorflow_dataset`	Convert documents to a `tf.data.Dataset`.
`to_torch_dataloader`	Convert documents to a `torch.utils.data.DataLoader`.
`LangChainCorpusRetriever`	LangChain-compatible retriever backed by `SimilarityIndex`.
`MCPCorpusServer`	MCP server adapter for corpus search.

Archive-within-archive#

`extract_archive`	Extract an archive to a destination directory.
`is_archive`	Check if a file path has a supported archive extension.

Base Classes#

`ChunkerBase`	Abstract base class for all text chunkers.
`DefaultFilter`	Standard noise filter ported and improved from remarx's `include_sentence`.
`DocumentReader`	Abstract base class for all format-specific document readers.
`DummyReader`	A no-op reader that validates source existence and accessibility.
`FilterBase`	Abstract base class for corpus document filters.
`PipelineGuard`	Wrap any document stream with resilience, deduplication, and checkpointing.
`_MultiSourceReader`	Chains multiple `DocumentReader` instances into one stream.
`_is_url`	Return `True` if s is a string that looks like an HTTP(S) URL.

Chunkers#

`ChunkerBridge`	Adapter that wraps a new-style chunker as a `ChunkerBase`- compatible object.
`FixedWindowChunkerBridge`	Bridge for `FixedWindowChunker` → `ChunkerBase` contract.
`ParagraphChunkerBridge`	Bridge for `ParagraphChunker` → `ChunkerBase` contract.
`SentenceChunkerBridge`	Bridge for `SentenceChunker` → `ChunkerBase` contract.
`WordChunkerBridge`	Bridge for `WordChunker` → `ChunkerBase` contract.
`bridge_chunker`	Wrap chunker in a bridge if it is a new-style chunker.
`register_bridge`	Register a custom bridge for a user-defined chunker class.
`unregister_bridge`	Remove a previously registered bridge for chunker_class.
`TokenizerProtocol`	Structural protocol for word tokenizers.
`SentenceSplitterProtocol`	Structural protocol for sentence segmenters.
`StemmerProtocol`	Structural protocol for word stemmers.
`LemmatizerProtocol`	Structural protocol for word lemmatizers.
`FunctionTokenizer`	Wrap any `Callable[[str], list[str]]` as a `TokenizerProtocol`.
`FunctionSentenceSplitter`	Wrap any `Callable[[str], list[str]]` as a `SentenceSplitterProtocol`.
`FunctionStemmer`	Wrap any `Callable[[str], str]` as a `StemmerProtocol`.
`FunctionLemmatizer`	Wrap any `Callable[[str, Optional[str]], str]` as a `LemmatizerProtocol`.
`CustomTokenizerRegistry`	Thread-safe(ish) module-level registry for named custom components.
`register_tokenizer`	Register a named `TokenizerProtocol` implementation.
`get_tokenizer`	Retrieve a registered tokenizer by name.
`register_sentence_splitter`	Register a named `SentenceSplitterProtocol` implementation.
`get_sentence_splitter`	Retrieve a registered sentence splitter by name.
`register_stemmer`	Register a named `StemmerProtocol` implementation.
`get_stemmer`	Retrieve a registered stemmer by name.
`register_lemmatizer`	Register a named `LemmatizerProtocol` implementation.
`get_lemmatizer`	Retrieve a registered lemmatizer by name.
`ScriptType`	Dominant Unicode script detected in a text sample.
`detect_script`	Detect the dominant Unicode script in text.
`is_cjk_char`	Return `True` if ch is a CJK / Japanese / Korean character.
`is_rtl_char`	Return `True` if ch belongs to a right-to-left script.
`split_cjk_chars`	Split text into individual CJK character tokens.
`MULTI_SCRIPT_SENTENCE_RE_PATTERN`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`FixedWindowChunker`	Produce fixed-size sliding-window chunks over a document.
`FixedWindowChunkerConfig`	Configuration for `FixedWindowChunker`.
`WindowUnit`	Unit of measurement for window size and step.
`ISO_TO_NLTK`
`ISO_TO_NAME`
`NLTK_TO_ISO`
`NLTK_STOPWORD_LANGUAGES`	frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object
`BUILTIN_LANG_STOPWORDS`
`coerce_language`	Normalise any language specifier into a list of canonical NLTK names.
`resolve_stopwords`	Return a frozenset of stopwords for one or more languages.
`iso_to_nltk`	Resolve an ISO 639-1/639-3 code to a canonical NLTK language name.
`nltk_to_iso`	Resolve a canonical NLTK language name to its primary ISO 639-1 code.
`ParagraphChunker`	Split a document into paragraph-level `Chunk` objects.
`ParagraphChunkerConfig`	Configuration for `ParagraphChunker`.
`SentenceBackend`	Supported sentence-splitting backends.
`SentenceChunker`	Split a document into sentence-level `Chunk` objects.
`SentenceChunkerConfig`	Configuration for `SentenceChunker`.
`LemmatizationBackend`	Lemmatization backend.
`StemmingBackend`	Stemming algorithm.
`StopwordSource`	Stopword list source.
`TokenizerBackend`	Word tokenisation backend.
`WordChunker`	Process a document at word level, producing normalised token chunks.
`WordChunkerConfig`	Configuration for `WordChunker`.

Corpus Builder#

`BuildResult`	Result of a corpus build operation.
`BuilderConfig`	Configuration for `CorpusBuilder`.
`CorpusBuilder`	Unified corpus builder — end-to-end pipeline orchestrator.

Custom Hooks#

`BuilderFactories`	Component factory callables for `FactoryCorpusBuilder`.
`CustomChunker`	Wrap any callable as a `ChunkerBase`.
`CustomEnricherConfig`	Custom backend callables for `CustomNLPEnricher`.
`CustomFilter`	Wrap any callable as a `FilterBase`.
`CustomNLPEnricher`	`NLPEnricher` extended with fully-replaceable NLP backends.
`CustomNormalizer`	Wrap any callable as a `NormalizerBase`.
`CustomSimilarityIndex`	`SimilarityIndex` extended with a fully-replaceable custom scorer callable.
`FactoryCorpusBuilder`	`CorpusBuilder` extended with pluggable component factories.
`HookableCorpusPipeline`	`CorpusPipeline` extended with per-stage lifecycle hooks.
`PipelineHooks`	Lifecycle callbacks for `HookableCorpusPipeline`.

Embeddings#

`DEFAULT_CACHE_DIR`	Path subclass for non-Windows systems.
`DEFAULT_MODEL`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`EmbeddingEngine`	Multi-backend sentence embedding engine with SHA-256 file caching.
`DEFAULT_AUDIO_MODEL`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`DEFAULT_IMAGE_MODEL`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`DEFAULT_TEXT_MODEL`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`LLMTrainingExporter`	Export a corpus with embeddings to LLM training formats.
`MultimodalEmbeddingEngine`	Unified embedding engine for any `CorpusDocument` modality — text, image, audio, video, or multimodal.

Enricher#

`BUILTIN_STOPWORDS`	frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object
`EnricherConfig`	Configuration for `NLPEnricher`.
`NLPEnricher`	Pipeline component that populates NLP enrichment fields on `CorpusDocument`.

Export#

`export_documents`	Export a list of documents to `output_path` in the given format.
`load_documents`	Load `CorpusDocument` instances from a previously exported file.

Metadata#

`CollectionManifest`	Descriptor for a named corpus collection.
`CorpusStats`	Aggregate statistics over a `CorpusDocument` collection.
`compute_stats`	Compute aggregate statistics over a document collection.
`provenance_from_filename`	Extract provenance metadata from a source filename using heuristics.

Normalizers#

`DedupLinesNormalizer`	Remove exact duplicate lines while preserving first-occurrence order.
`HTMLStripNormalizer`	Remove HTML and XML tags from the document text.
`LanguageDetectionNormalizer`	Detect document language and set `CorpusDocument.language`.
`LowercaseNormalizer`	Convert the document text to lowercase.
`NormalizationPipeline`	Apply a sequence of normalisers in order.
`NormalizerBase`	Abstract base class for all text normalisers.
`UnicodeNormalizer`	Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD).
`WhitespaceNormalizer`	Collapse runs of whitespace and optionally strip leading/trailing space.
`NormalizerConfig`	Configuration for `TextNormalizer`.
`TextNormalizer`	Pipeline component that populates `normalized_text` on `CorpusDocument` instances.
`normalize_text`	Normalise text according to config.

Pipeline#

`CorpusPipeline`	Orchestrates the full corpus ingestion pipeline.
`PipelineResult`	Immutable summary of a single pipeline run.
`create_corpus`	Create and export a corpus from a single source file.

Readers#

`ALTOReader`	ALTO XML reader for scanned document archives.
`AudioReader`	Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification.
`CustomReader`	Fully user-customizable reader for any file extension and resource type.
`normalize_extractor_output`	Coerce an extractor return value to a list of raw chunk dicts.
`ImageReader`	OCR-based text extraction from raster image files.
`PDFReader`	PDF document reader with pdfminer.six → pypdf cascade.
`MarkdownReader`	Markdown document reader.
`ReSTReader`	reStructuredText document reader.
`TextReader`	Plain-text document reader.
`VideoReader`	Text extraction from video files via subtitle parsing and/or automatic speech recognition.
`WebReader`	Fetch a web page and extract structured text via BeautifulSoup.
`YouTubeReader`	Extract the transcript of a YouTube video using `youtube-transcript-api`.
`TEIReader`	TEI/XML document reader with dramatic structure extraction.
`XMLReader`	Generic XML document reader with configurable XPath.
`ZipReader`	Generic ZIP archive reader — dispatches each member to its natural reader.

Registry#

`ComponentRegistry`	Central look-up table for corpus pipeline components.
`registry`	Central look-up table for corpus pipeline components.

Similarity#

`SearchConfig`	Configuration for similarity search.
`SearchResult`	A single search result.
`SimilarityIndex`	Multi-mode similarity index over `CorpusDocument` collections.

Schema#

`SectionType`	Semantic label for the role of a text chunk within its source document.
`ChunkingStrategy`	Describes how a `CorpusDocument` was segmented from raw text.
`ExportFormat`	Supported serialisation targets for a completed corpus.
`SourceType`	Semantic label for the kind of source from which a document was read.
`MatchMode`	Search mode for intertextual matching queries against a corpus index.
`Modality`	Primary content modality of a `CorpusDocument`.
`ErrorPolicy`	Per-document error handling strategy for `PipelineGuard`.
`CorpusDocument`	Canonical representation of a single text chunk in a processed corpus.
`_PROMOTED_RAW_KEYS`	frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object
`documents_to_pandas`	Convert a list of `CorpusDocument` instances to a `pandas.DataFrame`.
`documents_to_polars`	Convert a list of `CorpusDocument` instances to a `polars.DataFrame`.

Source#

`CorpusSource`	Declarative descriptor for one or more document sources.
`SourceEntry`	A single resolved source entry yielded by `CorpusSource.iter_entries`.
`SourceKind`	Discriminant for the kind of source an entry represents.

Storage#

`InMemoryStorage`	Thread-safe in-memory dict store.
`JSONLStorage`	Append-friendly JSONL (newline-delimited JSON) flat-file store.
`QueryResult`	Result container returned by `StorageBase.query`.
`SQLiteStorage`	SQLite-backed corpus store with FTS5 full-text search.
`StorageBase`	Abstract base class for all corpus storage backends.
`StorageQuery`	Query parameters for `StorageBase.query`.

URL#

`URLKind`	Classification of a URL for routing to the correct handler.
`classify_url`	Classify a URL into one of the known `URLKind` categories.
`download_url`	Download a URL to a local file.
`infer_extension`	Infer a file extension from HTTP response headers and URL path.
`probe_url_kind`	Probe a URL with a HEAD request to classify by Content-Type.
`resolve_url`	Resolve a provider-specific URL to a direct-download URL.