create_corpus#
- scikitplot.corpus.create_corpus(input_file, output_path, *, chunker=None, filter_=None, normalizer=None, enricher=None, filename_override=None, export_format=ExportFormat.CSV, default_language=None)[source]#
Create and export a corpus from a single source file.
Convenience wrapper around
CorpusPipelinefor the common single-file, single-output use case. Directly replaces remarx’screate_corpus()function.- Parameters:
- input_filepathlib.Path or str
Path to the input file (local) or an
http(s)://URL string.- output_pathpathlib.Path or str
Path for the exported corpus file.
- chunkerChunkerBase or None, optional
Text chunker. Default:
None(one doc per raw chunk).- filter_FilterBase or None, optional
Document filter. Default:
None(DefaultFilter).- normalizerTextNormalizer or None, optional
When provided,
normalized_textis populated on every document after chunking/filtering. Cleans OCR noise, ligatures, and whitespace artefacts before embedding. Default:None(skip).- enricherNLPEnricher or None, optional
When provided, NLP fields (
tokens,lemmas,stems,keywords, and optional extended metadata) are populated on every document after normalisation. Supports 200+ world languages vialanguage. Default:None(skip).- filename_overridestr or None, optional
Override the
source_filelabel in generated documents.- export_formatExportFormat, optional
Output format. Default:
CSV.- default_languagestr or list[str] or None, optional
ISO 639-1 code(s) or NLTK language name(s) applied when the reader cannot detect language. Accepts
"en","english",["en", "ar"], orNone(auto-detect). Default:None.
- Returns:
- PipelineResult
Immutable summary including the document list, counts, timing, and output path.
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
normalizer (TextNormalizer | None)
enricher (NLPEnricher | None)
filename_override (str | None)
export_format (ExportFormat)
- Return type:
Examples
Basic single-file corpus:
>>> from pathlib import Path >>> from scikitplot.corpus._pipeline import create_corpus >>> result = create_corpus( ... input_file=Path("chapter01.txt"), ... output_path=Path("output/chapter01.csv"), ... ) >>> len(result.documents) 312
With normalisation and NLP enrichment:
>>> from scikitplot.corpus import TextNormalizer, NLPEnricher, EnricherConfig >>> result = create_corpus( ... input_file=Path("scan.png"), ... output_path=Path("output/scan.csv"), ... normalizer=TextNormalizer(), ... enricher=NLPEnricher( ... EnricherConfig( ... language="en", ... keyword_extractor="tfidf", ... sentence_count=True, ... char_count=True, ... ) ... ), ... )
Multi-language corpus:
>>> result = create_corpus( ... input_file=Path("multilang.txt"), ... output_path=Path("output/multilang.csv"), ... enricher=NLPEnricher(EnricherConfig(language=["en", "ar", "hi"])), ... )