BuilderFactories#
- class scikitplot.corpus.BuilderFactories(reader_factory=None, chunker_factory=None, filter_factory=None, normalizer_factory=None, enricher_factory=None, embedding_engine_factory=None)[source]#
Component factory callables for
FactoryCorpusBuilder.Each factory replaces the corresponding lazy-creation method in
CorpusBuilder.Nonemeans “use the default fromBuilderConfig”.- Parameters:
- reader_factorycallable or None, optional
Factory for
DocumentReader. Called once per source. Receives(source: str | Path, chunker, **reader_kwargs) -> DocumentReader. Signature:def reader_factory( source: str | pathlib.Path, chunker: ChunkerBase | None, **reader_kwargs: Any, ) -> DocumentReader: ...
- chunker_factorycallable or None, optional
Factory for the chunker. Called once at build time. No arguments. Signature:
def chunker_factory() -> ChunkerBase | None: ...
- filter_factorycallable or None, optional
Factory for the
FilterBase. Called once at build time. No arguments. Signature:def filter_factory() -> FilterBase | None: ...
- normalizer_factorycallable or None, optional
Factory for the
NormalizationPipeline. Called once at build time. No arguments. Signature:def normalizer_factory() -> NormalizationPipeline | None: ...
- enricher_factorycallable or None, optional
Factory for the enricher. Called once at build time. No arguments. Signature:
def enricher_factory() -> NLPEnricher | None: ...
- embedding_engine_factorycallable or None, optional
Factory for the embedding engine. Called once at build time. No arguments. Signature:
def embedding_engine_factory() -> EmbeddingEngine | None: ...
- Parameters:
Notes
User note: Factories take precedence over the corresponding
BuilderConfigsettings. For example, ifchunker_factoryis set,BuilderConfig.chunkeris ignored for chunker creation.Examples
Use a custom reader factory that injects a per-source language code:
from langdetect import detect def smart_reader_factory(source, chunker, **kw): lang = detect(open(source).read(200)) if Path(source).exists() else None return DocumentReader.create(source, chunker=chunker, default_language=lang) factories = BuilderFactories(reader_factory=smart_reader_factory) builder = FactoryCorpusBuilder(factories=factories) result = builder.build("./data/")