BuilderConfig#
- class scikitplot.corpus.BuilderConfig(chunker='sentence', chunker_kwargs=<factory>, normalize=True, normalizer_steps=<factory>, normalizer_kwargs=<factory>, enrich=False, enricher_kwargs=<factory>, embed=False, embedding_model='all-MiniLM-L6-v2', embedding_kwargs=<factory>, build_index=False, index_kwargs=<factory>, source_title=None, source_author=None, source_type=None, collection_id=None, default_language=None, filter_kwargs=<factory>, max_download_bytes=524288000, download_timeout=120, download_max_retries=3, download_retry_backoff=1.0, max_archive_files=10000, max_archive_bytes=2147483648, probe_url_content_type=True, probe_url_timeout=15, max_workers=1)[source]#
Configuration for
CorpusBuilder.- Parameters:
- chunkerstr or object
Chunker to use. One of
"sentence","paragraph","fixed_window","word"; or a pre-configured chunker instance (eitherChunkerBasesubclass or new-style chunker — auto-bridged).- chunker_kwargsdict[str, Any]
Keyword arguments passed to the chunker constructor (ignored if chunker is already an instance).
- normalizebool
Run normalisation pipeline after filtering.
- normalizer_stepslist[str]
Normaliser names:
"unicode","whitespace","html_strip","lowercase","dedup_lines". Default:["unicode", "whitespace"].- normalizer_kwargsdict[str, Any]
Run
TextNormalizerafter filtering. Kwargs forNormalizerConfig.- enrichbool
Run
NLPEnricherafter normalisation.- enricher_kwargsdict[str, Any]
Kwargs for
EnricherConfig.- embedbool
Run
EmbeddingEngineafter enrichment.- embedding_modelstr
Model name for
EmbeddingEngine.- embedding_kwargsdict[str, Any]
Kwargs for
EmbeddingEngineconstructor.- build_indexbool
Build a
SimilarityIndexafter embedding.- index_kwargsdict[str, Any]
Kwargs for
SearchConfig.- source_titlestr or None
Default
source_titlefor all documents.- source_authorstr or None
Default
source_authorfor all documents.- source_typestr or None
Default
source_type(e.g.,"book","movie").- collection_idstr or None
Group identifier for this corpus build.
- default_languagestr or None
ISO 639-1 language code.
- filter_kwargsdict[str, Any]
Kwargs for
DefaultFilter.- max_workersint
Parallelism for multi-file ingestion.
- probe_url_content_typebool
When
True(default), extensionless URLs are probed with an HTTP HEAD request to infer the correct reader before downloading. Disable to save a round-trip when all URLs have file extensions.- probe_url_timeoutint
HTTP timeout in seconds for
probe_url_kindcalls. Default: 15.
- Parameters:
normalize (bool)
enrich (bool)
embed (bool)
embedding_model (str)
build_index (bool)
source_title (str | None)
source_author (str | None)
source_type (str | None)
collection_id (str | None)
default_language (str | None)
max_download_bytes (int)
download_timeout (int)
download_max_retries (int)
download_retry_backoff (float)
max_archive_files (int)
max_archive_bytes (int)
probe_url_content_type (bool)
probe_url_timeout (int)
max_workers (int)
Notes
User note: Most users need only:
config = BuilderConfig(chunker="sentence", embed=True)
Everything else has sensible defaults.
- download_max_retries: int = 3#
Maximum retry attempts for transient HTTP errors (429, 500, 502, 503, 504) during URL downloads. Set to
0to disable retries. Default: 3.
- download_retry_backoff: float = 1.0#
Base delay in seconds for exponential back-off between download retries. Actual wait =
download_retry_backoff * 2 ** attempt. Default: 1.0.
- max_archive_bytes: int = 2147483648#
2 GB.
- Type:
Maximum cumulative extracted size per archive. Default
- probe_url_content_type: bool = True#
Probe extensionless URLs with a HEAD request to determine the correct reader. When
True(default), any URL thatclassify_urlclassifies asWEB_PAGEand has no file extension in its path is probed viaprobe_url_kindbefore routing. Set toFalseto skip the extra network round-trip (e.g. when all your URLs already carry file extensions or you want pure-offline operation). Default:True.