create_corpus#

scikitplot.corpus.create_corpus(input_file, output_path, *, chunker=None, filter_=None, filename_override=None, export_format=ExportFormat.CSV, default_language=None)[source]#

Create and export a corpus from a single source file.

Convenience wrapper around CorpusPipeline for the common single-file, single-output use case. Directly replaces remarx’s create_corpus() function.

Parameters:
input_filepathlib.Path or str

Path to the input file.

output_pathpathlib.Path or str

Path for the exported corpus file.

chunkerChunkerBase or None, optional

Text chunker. Default: None (one doc per raw chunk).

filter_FilterBase or None, optional

Document filter. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the source_file label.

export_formatExportFormat, optional

Output format. Default: ExportFormat.CSV.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

Returns:
PipelineResult
Parameters:
Return type:

PipelineResult

Examples

>>> from pathlib import Path
>>> from scikitplot.corpus._pipeline import create_corpus
>>> result = create_corpus(
...     input_file=Path("chapter01.txt"),
...     output_path=Path("output/chapter01.csv"),
... )
>>> len(result.documents)
312