create_corpus#

scikitplot.corpus.create_corpus(input_file, output_path, *, chunker=None, filter_=None, normalizer=None, enricher=None, filename_override=None, export_format=ExportFormat.CSV, default_language=None)[source]#

Create and export a corpus from a single source file.

Convenience wrapper around CorpusPipeline for the common single-file, single-output use case. Directly replaces remarx’s create_corpus() function.

Parameters:
input_filepathlib.Path or str

Path to the input file (local) or an http(s):// URL string.

output_pathpathlib.Path or str

Path for the exported corpus file.

chunkerChunkerBase or None, optional

Text chunker. Default: None (one doc per raw chunk).

filter_FilterBase or None, optional

Document filter. Default: None (DefaultFilter).

normalizerTextNormalizer or None, optional

When provided, normalized_text is populated on every document after chunking/filtering. Cleans OCR noise, ligatures, and whitespace artefacts before embedding. Default: None (skip).

enricherNLPEnricher or None, optional

When provided, NLP fields (tokens, lemmas, stems, keywords, and optional extended metadata) are populated on every document after normalisation. Supports 200+ world languages via language. Default: None (skip).

filename_overridestr or None, optional

Override the source_file label in generated documents.

export_formatExportFormat, optional

Output format. Default: CSV.

default_languagestr or list[str] or None, optional

ISO 639-1 code(s) or NLTK language name(s) applied when the reader cannot detect language. Accepts "en", "english", ["en", "ar"], or None (auto-detect). Default: None.

Returns:
PipelineResult

Immutable summary including the document list, counts, timing, and output path.

Parameters:
Return type:

PipelineResult

Examples

Basic single-file corpus:

>>> from pathlib import Path
>>> from scikitplot.corpus._pipeline import create_corpus
>>> result = create_corpus(
...     input_file=Path("chapter01.txt"),
...     output_path=Path("output/chapter01.csv"),
... )
>>> len(result.documents)
312

With normalisation and NLP enrichment:

>>> from scikitplot.corpus import TextNormalizer, NLPEnricher, EnricherConfig
>>> result = create_corpus(
...     input_file=Path("scan.png"),
...     output_path=Path("output/scan.csv"),
...     normalizer=TextNormalizer(),
...     enricher=NLPEnricher(
...         EnricherConfig(
...             language="en",
...             keyword_extractor="tfidf",
...             sentence_count=True,
...             char_count=True,
...         )
...     ),
... )

Multi-language corpus:

>>> result = create_corpus(
...     input_file=Path("multilang.txt"),
...     output_path=Path("output/multilang.csv"),
...     enricher=NLPEnricher(EnricherConfig(language=["en", "ar", "hi"])),
... )