CorpusBuilder#

class scikitplot.corpus.CorpusBuilder(config=None)[source]#

Unified corpus builder — end-to-end pipeline orchestrator.

Parameters:

configBuilderConfig or None, optional: Pipeline configuration. None uses defaults.

Parameters:

config (BuilderConfig | None)

See also

scikitplot.corpus._pipeline.CorpusPipeline: Lower-level pipeline (used internally by the builder).
scikitplot.corpus._adapters: Conversion functions for downstream consumers.
scikitplot.corpus._similarity.SimilarityIndex: Search engine.

Notes

User note: Typical usage:

from scikitplot.corpus import CorpusBuilder, BuilderConfig

# Simple: process a directory of PDFs
builder = CorpusBuilder()
result = builder.build("./papers/")

# Full pipeline: chunk → normalise → enrich → embed → index
config = BuilderConfig(
    chunker="paragraph",
    normalize=True,
    enrich=True,
    embed=True,
    build_index=True,
    collection_id="shakespeare-corpus",
)
builder = CorpusBuilder(config)
result = builder.build(["hamlet.txt", "othello.txt"])

# Search
results = builder.search("To be or not to be")

# Export to LangChain
lc_docs = builder.to_langchain()

# Export to MCP
mcp_result = builder.to_mcp_tool_result("death soliloquy")

Developer note: The builder is the single orchestration point. It lazily creates component instances on first use and caches them. Each build() call produces an independent BuildResult.

Examples

>>> builder = CorpusBuilder(BuilderConfig(embed=True))
>>> result = builder.build("./data/books/")
>>> print(result.summary())
>>> lc_docs = builder.to_langchain()

add(input_path, *, source_title=None, source_author=None, source_type=None, collection_id=None, rebuild_index=True)[source]#

Add sources to an existing corpus without re-processing.

Incrementally ingests new sources and appends their documents to the existing BuildResult. Optionally rebuilds the similarity index to include the new documents.

Parameters:

input_pathstr, Path, or Sequence[str | Path]: File path(s), directory path(s), or URL(s) to add.
source_titlestr or None, optional: Override title for new sources.
source_authorstr or None, optional: Override author for new sources.
source_typestr or None, optional: Override source_type for new sources (e.g. "audio"). When None the type is inferred from each file extension. Default: None.
collection_idstr or None, optional: Override collection id for new sources.
rebuild_indexbool, optional: When True and config.build_index is enabled, rebuild the similarity index with all documents (existing + new). Default: True.

Returns:

BuildResult: The updated result containing all documents.

Raises:

RuntimeError: If build has not been called yet.
ValueError: If no valid sources are found.

Parameters:

input_path (str | Path | Sequence[str | Path])
source_title (str | None)
source_author (str | None)
source_type (str | None)
collection_id (str | None)
rebuild_index (bool)

Return type:

BuildResult

Notes

User note: Use this to extend a corpus after the initial build():

builder = CorpusBuilder(config)
result = builder.build("./initial_data/")
result = builder.add("./new_data/")
result = builder.add("https://example.com/article")

Developer note: Normalisation, enrichment, and embedding are applied to the new documents only. The index is rebuilt from scratch over all documents because incremental index updates are not supported by all backends.

build(input_path, *, source_title=None, source_author=None, collection_id=None)[source]#

Build a corpus from one or more sources.

Parameters:

input_pathstr, Path, or Sequence[str | Path]

File path(s), directory path(s), or URL(s). Accepts:

A single file path: "hamlet.txt"
A directory: "./papers/" (recursive)
A URL: "https://example.com/article"
A list of any mix: ["a.pdf", "b.mp4", "https://..."]

source_titlestr or None, optional

Override config.source_title for this build.

source_authorstr or None, optional

Override config.source_author.

collection_idstr or None, optional

Override config.collection_id.

Returns:

BuildResult: The build result with documents, counts, and index.

Raises:

ValueError: If no valid input_path sources are found.

Parameters:

input_path (str | Path | Sequence[str | Path])
source_title (str | None)
source_author (str | None)
collection_id (str | None)

Return type:

BuildResult

close()[source]#

Clean up temporary files created during downloads/extraction.

Notes

Safe to call multiple times. After calling, the builder can still be used — a new temp directory will be created on next download.

Return type:: None

export(path, *, format='parquet', **kwargs)[source]#

Export documents to a file.

Parameters:

pathstr or Path: Output file path.
formatstr, optional: "csv", "parquet", "jsonl", "json", "pickle".
**kwargs: Additional kwargs for the export function.

Returns:

Path: The output file path.

Parameters:

path (str | Path)
format (str)
kwargs (Any)

Return type:

Path

search(query, *, top_k=10, match_mode='hybrid', **kwargs)[source]#

Search the built corpus.

Parameters:

querystr: Natural language query.
top_kint, optional: Maximum results.
match_modestr, optional: "strict", "keyword", "semantic", or "hybrid".
**kwargs: Additional SearchConfig parameters.

Returns:

list[SearchResult]: Ranked results.

Raises:

RuntimeError: If no index has been built.
RuntimeError: If match_mode is "semantic" or "hybrid" and no embedding engine is configured.

Parameters:

query (str)
top_k (int)
match_mode (str)
kwargs (Any)

Return type:

list[Any]

to_huggingface()[source]#

Export as HuggingFace Dataset.

Returns:

datasets.Dataset or dict[str, list]

Return type:

Any

to_jsonl()[source]#

Export as JSONL lines.

Yields:

str

Return type:

Iterator[str]

to_langchain()[source]#

Export documents as LangChain Document objects.

Returns:

list[langchain_core.documents.Document] or list[dict]

Return type:

list[Any]

to_langchain_retriever()[source]#

Create a LangChain-compatible retriever.

Returns:

LangChainCorpusRetriever

Return type:

Any

to_langgraph_state(query='', match_mode='')[source]#

Export as LangGraph-compatible state dict.

Returns:

dict[str, Any]

Parameters:

query (str)
match_mode (str)

Return type:

dict[str, Any]

to_mcp_resources(uri_prefix='corpus://')[source]#

Export as MCP resources.

Returns:

list[dict[str, Any]]

Parameters:

uri_prefix (str)

Return type:

list[dict[str, Any]]

to_mcp_server(server_name='corpus-search')[source]#

Create an MCP server adapter.

Parameters:

server_namestr, optional: MCP server name.

Returns:

MCPCorpusServer

Parameters:

server_name (str)

Return type:

Any

to_mcp_tool_result(query, *, top_k=10, match_mode='hybrid')[source]#

Search and format result as MCP tool response.

Parameters:

querystr: Search query.
top_kint, optional: Maximum results.
match_modestr, optional: Match mode.

Returns:

dict[str, Any]: MCP tools/call response.

Parameters:

query (str)
top_k (int)
match_mode (str)

Return type:

dict[str, Any]

to_rag_tuples()[source]#

Export as (text, metadata, embedding) tuples.

Returns:

list[tuple[str, dict, Any]]

Return type:

list[tuple[str, dict[str, Any], Any]]