scikitplot.corpus#

scikitplot.corpus#

A production-grade document corpus ingestion, chunking, filtering, embedding, and export pipeline for NLP and ML workflows.

This package is a ground-up rewrite of the remarx.sentence.corpus module, preserving all proven design patterns while resolving every known correctness, robustness, and maintainability issue identified during the migration audit.

Quick start#

Single file, no embedding:

>>> from pathlib import Path
>>> from scikitplot.corpus import CorpusPipeline, ParagraphChunker
>>> pipeline = CorpusPipeline(chunker=ParagraphChunker())
>>> result = pipeline.run(Path("article.txt"))
>>> print(f"{result.n_documents} chunks from {result.source}")

Batch processing with sentence chunking:

>>> from scikitplot.corpus import CorpusPipeline, SentenceChunker, ExportFormat
>>> pipeline = CorpusPipeline(
...     chunker=SentenceChunker("en_core_web_sm"),
...     output_dir=Path("output/"),
...     export_format=ExportFormat.PARQUET,
... )
>>> results = pipeline.run_batch(list(Path("corpus/").glob("*.txt")))

URL ingestion:

>>> result = pipeline.run_url("https://en.wikipedia.org/wiki/Python")

YouTube transcript:

>>> result = pipeline.run_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

Image OCR:

>>> reader = DocumentReader.create(Path("scan.png"))
>>> docs = list(reader.get_documents())

Video transcription (subtitle-first):

>>> reader = DocumentReader.create(Path("lecture.mp4"))
>>> docs = list(reader.get_documents())

With embeddings:

>>> from scikitplot.corpus import EmbeddingEngine
>>> engine = EmbeddingEngine(backend="sentence_transformers")
>>> pipeline = CorpusPipeline(
...     chunker=ParagraphChunker(),
...     embedding_engine=engine,
... )
>>> result = pipeline.run(Path("article.txt"))
>>> result.documents[0].has_embedding
True

Convenience function (direct replacement for remarx create_corpus):

>>> from scikitplot.corpus import create_corpus
>>> result = create_corpus(
...     input_file=Path("chapter01.txt"),
...     output_path=Path("output/chapter01.csv"),
... )
>>> from scikitplot.corpus import CorpusBuilder, BuilderConfig
>>> builder = CorpusBuilder(
...     BuilderConfig(
...         chunker="paragraph",
...         normalize=True,
...         enrich=True,
...         embed=True,
...         build_index=True,
...     )
... )
>>> result = builder.build("./data/")
>>> results = builder.search("quantum computing")
>>> lc_docs = builder.to_langchain()
>>> mcp_response = builder.to_mcp_tool_result("quantum computing")

Package structure#

scikitplot.corpus._schema

Core data types: CorpusDocument, SectionType, ChunkingStrategy, ExportFormat, SourceType, MatchMode.

scikitplot.corpus._base

Abstract bases: DocumentReader, ChunkerBase, FilterBase, DefaultFilter.

scikitplot.corpus._chunkers

SentenceChunker, ParagraphChunker, FixedWindowChunker.

scikitplot.corpus._readers

TextReader, MarkdownReader, ReSTReader, XMLReader, TEIReader, AudioReader, ALTOReader, PDFReader, ImageReader, VideoReader, WebReader, YouTubeReader.

scikitplot.corpus._embeddings

EmbeddingEngine – multi-backend embedding with disk cache.

scikitplot.corpus._pipeline

CorpusPipeline, PipelineResult, create_corpus.

scikitplot.corpus._export

export_documents, load_documents.

User guide. See the Corpus Generation section for further details.

Base#

ChunkerBase

Abstract base class for all text chunkers.

DefaultFilter

Standard noise filter ported and improved from remarx's include_sentence.

DocumentReader

Abstract base class for all format-specific document readers.

FilterBase

Abstract base class for corpus document filters.

Readers#

ALTOReader

ALTO XML reader for scanned document archives.

AudioReader

Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification.

ImageReader

OCR-based text extraction from raster image files.

MarkdownReader

Markdown document reader.

PDFReader

PDF document reader with pdfminer.six → pypdf cascade.

ReSTReader

reStructuredText document reader.

TEIReader

TEI/XML document reader with dramatic structure extraction.

TextReader

Plain-text document reader.

VideoReader

Text extraction from video files via subtitle parsing and/or automatic speech recognition.

WebReader

Fetch a web page and extract structured text via BeautifulSoup.

XMLReader

Generic XML document reader with configurable XPath.

YouTubeReader

Extract the transcript of a YouTube video using youtube-transcript-api.

Chunkers#

FixedWindowChunker

Produce fixed-size sliding-window chunks over a document.

ParagraphChunker

Split a document into paragraph-level Chunk objects.

SentenceChunker

Split a document into sentence-level Chunk objects.

WordChunker

Process a document at word level, producing normalised token chunks.

Normalizers#

DedupLinesNormalizer

Remove exact duplicate lines while preserving first-occurrence order.

HTMLStripNormalizer

Remove HTML and XML tags from the document text.

LanguageDetectionNormalizer

Detect document language and set CorpusDocument.language.

LowercaseNormalizer

Convert the document text to lowercase.

NormalizationPipeline

Apply a sequence of normalisers in order.

NormalizerBase

Abstract base class for all text normalisers.

NormalizerConfig

Configuration for TextNormalizer.

TextNormalizer

Pipeline component that populates normalized_text on CorpusDocument instances.

UnicodeNormalizer

Apply Unicode normalisation (NFC, NFD, NFKC, or NFKD).

WhitespaceNormalizer

Collapse runs of whitespace and optionally strip leading/trailing space.

normalize_text

Normalise text according to config.

Enricher#

EnricherConfig

Configuration for NLPEnricher.

NLPEnricher

Pipeline component that populates NLP enrichment fields on CorpusDocument.

Embeddings#

DEFAULT_MODEL

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

EmbedFn

EmbeddingEngine

Multi-backend sentence embedding engine with SHA-256 file caching.

Similarity#

SearchConfig

Configuration for similarity search.

SearchResult

A single search result.

SimilarityIndex

Multi-mode similarity index over CorpusDocument collections.

Pipeline#

CorpusPipeline

Orchestrates the full corpus ingestion pipeline.

PipelineResult

Immutable summary of a single pipeline run.

create_corpus

Create and export a corpus from a single source file.