CorpusBuilder#

class scikitplot.corpus.CorpusBuilder(config=None)[source]#

Unified corpus builder — end-to-end pipeline orchestrator.

Parameters:
configBuilderConfig or None, optional

Pipeline configuration. None uses defaults.

Parameters:

config (BuilderConfig | None)

See also

scikitplot.corpus._pipeline.CorpusPipeline

Lower-level pipeline (used internally by the builder).

scikitplot.corpus._adapters

Conversion functions for downstream consumers.

scikitplot.corpus._similarity.SimilarityIndex

Search engine.

Notes

User note: Typical usage:

from scikitplot.corpus import CorpusBuilder, BuilderConfig

# Simple: process a directory of PDFs
builder = CorpusBuilder()
result = builder.build("./papers/")

# Full pipeline: chunk → normalise → enrich → embed → index
config = BuilderConfig(
    chunker="paragraph",
    normalize=True,
    enrich=True,
    embed=True,
    build_index=True,
    collection_id="shakespeare-corpus",
)
builder = CorpusBuilder(config)
result = builder.build(["hamlet.txt", "othello.txt"])

# Search
results = builder.search("To be or not to be")

# Export to LangChain
lc_docs = builder.to_langchain()

# Export to MCP
mcp_result = builder.to_mcp_tool_result("death soliloquy")

Developer note: The builder is the single orchestration point. It lazily creates component instances on first use and caches them. Each build() call produces an independent BuildResult.

Examples

>>> builder = CorpusBuilder(BuilderConfig(embed=True))
>>> result = builder.build("./data/books/")
>>> print(result.summary())
>>> lc_docs = builder.to_langchain()
add(sources, *, source_title=None, source_author=None, source_type=None, collection_id=None, rebuild_index=True)[source]#

Add sources to an existing corpus without re-processing.

Incrementally ingests new sources and appends their documents to the existing BuildResult. Optionally rebuilds the similarity index to include the new documents.

Parameters:
sourcesstr, Path, or Sequence[str | Path]

File path(s), directory path(s), or URL(s) to add.

source_titlestr or None, optional

Override title for new sources.

source_authorstr or None, optional

Override author for new sources.

source_typestr or None, optional

Override source_type for new sources (e.g. "audio"). When None the type is inferred from each file extension. Default: None.

collection_idstr or None, optional

Override collection id for new sources.

rebuild_indexbool, optional

When True and config.build_index is enabled, rebuild the similarity index with all documents (existing + new). Default: True.

Returns:
BuildResult

The updated result containing all documents.

Raises:
RuntimeError

If build has not been called yet.

ValueError

If no valid sources are found.

Parameters:
Return type:

BuildResult

Notes

User note: Use this to extend a corpus after the initial build():

builder = CorpusBuilder(config)
result = builder.build("./initial_data/")
result = builder.add("./new_data/")
result = builder.add("https://example.com/article")

Developer note: Normalisation, enrichment, and embedding are applied to the new documents only. The index is rebuilt from scratch over all documents because incremental index updates are not supported by all backends.

build(sources, *, source_title=None, source_author=None, collection_id=None)[source]#

Build a corpus from one or more sources.

Parameters:
sourcesstr, Path, or Sequence[str | Path]

File path(s), directory path(s), or URL(s). Accepts:

  • A single file path: "hamlet.txt"

  • A directory: "./papers/" (recursive)

  • A URL: "https://example.com/article"

  • A list of any mix: ["a.pdf", "b.mp4", "https://..."]

source_titlestr or None, optional

Override config.source_title for this build.

source_authorstr or None, optional

Override config.source_author.

collection_idstr or None, optional

Override config.collection_id.

Returns:
BuildResult

The build result with documents, counts, and index.

Raises:
ValueError

If no valid sources are found.

Parameters:
Return type:

BuildResult

close()[source]#

Clean up temporary files created during downloads/extraction.

Notes

Safe to call multiple times. After calling, the builder can still be used — a new temp directory will be created on next download.

Return type:

None

export(path, *, format='parquet', **kwargs)[source]#

Export documents to a file.

Parameters:
pathstr or Path

Output file path.

formatstr, optional

"csv", "parquet", "jsonl", "json", "pickle".

**kwargs

Additional kwargs for the export function.

Returns:
Path

The output file path.

Parameters:
Return type:

Path

search(query, *, top_k=10, match_mode='hybrid', **kwargs)[source]#

Search the built corpus.

Parameters:
querystr

Natural language query.

top_kint, optional

Maximum results.

match_modestr, optional

"strict", "keyword", "semantic", or "hybrid".

**kwargs

Additional SearchConfig parameters.

Returns:
list[SearchResult]

Ranked results.

Raises:
RuntimeError

If no index has been built.

RuntimeError

If match_mode is "semantic" or "hybrid" and no embedding engine is configured.

Parameters:
Return type:

list[Any]

to_huggingface()[source]#

Export as HuggingFace Dataset.

Returns:
datasets.Dataset or dict[str, list]
Return type:

Any

to_jsonl()[source]#

Export as JSONL lines.

Yields:
str
Return type:

Iterator[str]

to_langchain()[source]#

Export documents as LangChain Document objects.

Returns:
list[langchain_core.documents.Document] or list[dict]
Return type:

list[Any]

to_langchain_retriever()[source]#

Create a LangChain-compatible retriever.

Returns:
LangChainCorpusRetriever
Return type:

Any

to_langgraph_state(query='', match_mode='')[source]#

Export as LangGraph-compatible state dict.

Returns:
dict[str, Any]
Parameters:
  • query (str)

  • match_mode (str)

Return type:

dict[str, Any]

to_mcp_resources(uri_prefix='corpus://')[source]#

Export as MCP resources.

Returns:
list[dict[str, Any]]
Parameters:

uri_prefix (str)

Return type:

list[dict[str, Any]]

to_mcp_server(server_name='corpus-search')[source]#

Create an MCP server adapter.

Parameters:
server_namestr, optional

MCP server name.

Returns:
MCPCorpusServer
Parameters:

server_name (str)

Return type:

Any

to_mcp_tool_result(query, *, top_k=10, match_mode='hybrid')[source]#

Search and format result as MCP tool response.

Parameters:
querystr

Search query.

top_kint, optional

Maximum results.

match_modestr, optional

Match mode.

Returns:
dict[str, Any]

MCP tools/call response.

Parameters:
Return type:

dict[str, Any]

to_rag_tuples()[source]#

Export as (text, metadata, embedding) tuples.

Returns:
list[tuple[str, dict, Any]]
Return type:

list[tuple[str, dict[str, Any], Any]]