CorpusBuilder#
- class scikitplot.corpus.CorpusBuilder(config=None)[source]#
Unified corpus builder — end-to-end pipeline orchestrator.
- Parameters:
- configBuilderConfig or None, optional
Pipeline configuration.
Noneuses defaults.
- Parameters:
config (BuilderConfig | None)
See also
scikitplot.corpus._pipeline.CorpusPipelineLower-level pipeline (used internally by the builder).
scikitplot.corpus._adaptersConversion functions for downstream consumers.
scikitplot.corpus._similarity.SimilarityIndexSearch engine.
Notes
User note: Typical usage:
from scikitplot.corpus import CorpusBuilder, BuilderConfig # Simple: process a directory of PDFs builder = CorpusBuilder() result = builder.build("./papers/") # Full pipeline: chunk → normalise → enrich → embed → index config = BuilderConfig( chunker="paragraph", normalize=True, enrich=True, embed=True, build_index=True, collection_id="shakespeare-corpus", ) builder = CorpusBuilder(config) result = builder.build(["hamlet.txt", "othello.txt"]) # Search results = builder.search("To be or not to be") # Export to LangChain lc_docs = builder.to_langchain() # Export to MCP mcp_result = builder.to_mcp_tool_result("death soliloquy")
Developer note: The builder is the single orchestration point. It lazily creates component instances on first use and caches them. Each
build()call produces an independentBuildResult.Examples
>>> builder = CorpusBuilder(BuilderConfig(embed=True)) >>> result = builder.build("./data/books/") >>> print(result.summary()) >>> lc_docs = builder.to_langchain()
- add(sources, *, source_title=None, source_author=None, source_type=None, collection_id=None, rebuild_index=True)[source]#
Add sources to an existing corpus without re-processing.
Incrementally ingests new sources and appends their documents to the existing
BuildResult. Optionally rebuilds the similarity index to include the new documents.- Parameters:
- sourcesstr, Path, or Sequence[str | Path]
File path(s), directory path(s), or URL(s) to add.
- source_titlestr or None, optional
Override title for new sources.
- source_authorstr or None, optional
Override author for new sources.
- source_typestr or None, optional
Override
source_typefor new sources (e.g."audio"). WhenNonethe type is inferred from each file extension. Default:None.- collection_idstr or None, optional
Override collection id for new sources.
- rebuild_indexbool, optional
When
Trueandconfig.build_indexis enabled, rebuild the similarity index with all documents (existing + new). Default:True.
- Returns:
- BuildResult
The updated result containing all documents.
- Raises:
- RuntimeError
If
buildhas not been called yet.- ValueError
If no valid sources are found.
- Parameters:
- Return type:
Notes
User note: Use this to extend a corpus after the initial
build():builder = CorpusBuilder(config) result = builder.build("./initial_data/") result = builder.add("./new_data/") result = builder.add("https://example.com/article")
Developer note: Normalisation, enrichment, and embedding are applied to the new documents only. The index is rebuilt from scratch over all documents because incremental index updates are not supported by all backends.
- build(sources, *, source_title=None, source_author=None, collection_id=None)[source]#
Build a corpus from one or more sources.
- Parameters:
- sourcesstr, Path, or Sequence[str | Path]
File path(s), directory path(s), or URL(s). Accepts:
A single file path:
"hamlet.txt"A directory:
"./papers/"(recursive)A URL:
"https://example.com/article"A list of any mix:
["a.pdf", "b.mp4", "https://..."]
- source_titlestr or None, optional
Override
config.source_titlefor this build.- source_authorstr or None, optional
Override
config.source_author.- collection_idstr or None, optional
Override
config.collection_id.
- Returns:
- BuildResult
The build result with documents, counts, and index.
- Raises:
- ValueError
If no valid sources are found.
- Parameters:
- Return type:
- close()[source]#
Clean up temporary files created during downloads/extraction.
Notes
Safe to call multiple times. After calling, the builder can still be used — a new temp directory will be created on next download.
- Return type:
None
- search(query, *, top_k=10, match_mode='hybrid', **kwargs)[source]#
Search the built corpus.
- Parameters:
- querystr
Natural language query.
- top_kint, optional
Maximum results.
- match_modestr, optional
"strict","keyword","semantic", or"hybrid".- **kwargs
Additional
SearchConfigparameters.
- Returns:
- list[SearchResult]
Ranked results.
- Raises:
- RuntimeError
If no index has been built.
- RuntimeError
If
match_modeis"semantic"or"hybrid"and no embedding engine is configured.
- Parameters:
- Return type:
- to_huggingface()[source]#
Export as HuggingFace Dataset.
- Returns:
- datasets.Dataset or dict[str, list]
- Return type:
- to_langchain_retriever()[source]#
Create a LangChain-compatible retriever.
- Returns:
- LangChainCorpusRetriever
- Return type: