create_corpus#

scikitplot.corpus.create_corpus(input_file, output_path, *, chunker=None, filter_=None, normalizer=None, enricher=None, filename_override=None, export_format=ExportFormat.CSV, default_language=None)[source]#

Create and export a corpus from a single source file.

Convenience wrapper around CorpusPipeline for the common single-file, single-output use case. Directly replaces remarx’s create_corpus() function.

Parameters:

input_filepathlib.Path or str: Path to the input file (local) or an http(s):// URL string.
output_pathpathlib.Path or str: Path for the exported corpus file.
chunkerChunkerBase or None, optional: Text chunker. Default: None (one doc per raw chunk).
filter_FilterBase or None, optional: Document filter. Default: None (DefaultFilter).
normalizerTextNormalizer or None, optional: When provided, normalized_text is populated on every document after chunking/filtering. Cleans OCR noise, ligatures, and whitespace artefacts before embedding. Default: None (skip).
enricherNLPEnricher or None, optional: When provided, NLP fields (tokens, lemmas, stems, keywords, and optional extended metadata) are populated on every document after normalisation. Supports 200+ world languages via language. Default: None (skip).
filename_overridestr or None, optional: Override the source_file label in generated documents.
export_formatExportFormat, optional: Output format. Default: CSV.
default_languagestr or list[str] or None, optional: ISO 639-1 code(s) or NLTK language name(s) applied when the reader cannot detect language. Accepts "en", "english", ["en", "ar"], or None (auto-detect). Default: None.

Returns:

PipelineResult: Immutable summary including the document list, counts, timing, and output path.

Parameters:

input_file (Path | str)
output_path (Path | str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
normalizer (TextNormalizer | None)
enricher (NLPEnricher | None)
filename_override (str | None)
export_format (ExportFormat)
default_language (str | list | None)

Return type:

PipelineResult

Examples

Basic single-file corpus:

>>> from pathlib import Path
>>> from scikitplot.corpus._pipeline import create_corpus
>>> result = create_corpus(
...     input_file=Path("chapter01.txt"),
...     output_path=Path("output/chapter01.csv"),
... )
>>> len(result.documents)
312

With normalisation and NLP enrichment:

>>> from scikitplot.corpus import TextNormalizer, NLPEnricher, EnricherConfig
>>> result = create_corpus(
...     input_file=Path("scan.png"),
...     output_path=Path("output/scan.csv"),
...     normalizer=TextNormalizer(),
...     enricher=NLPEnricher(
...         EnricherConfig(
...             language="en",
...             keyword_extractor="tfidf",
...             sentence_count=True,
...             char_count=True,
...         )
...     ),
... )

Multi-language corpus:

>>> result = create_corpus(
...     input_file=Path("multilang.txt"),
...     output_path=Path("output/multilang.csv"),
...     enricher=NLPEnricher(EnricherConfig(language=["en", "ar", "hi"])),
... )