create_corpus#
- scikitplot.corpus.create_corpus(input_file, output_path, *, chunker=None, filter_=None, filename_override=None, export_format=ExportFormat.CSV, default_language=None)[source]#
Create and export a corpus from a single source file.
Convenience wrapper around
CorpusPipelinefor the common single-file, single-output use case. Directly replaces remarx’screate_corpus()function.- Parameters:
- input_filepathlib.Path or str
Path to the input file.
- output_pathpathlib.Path or str
Path for the exported corpus file.
- chunkerChunkerBase or None, optional
Text chunker. Default:
None(one doc per raw chunk).- filter_FilterBase or None, optional
Document filter. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override the
source_filelabel.- export_formatExportFormat, optional
Output format. Default:
ExportFormat.CSV.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.
- Returns:
- PipelineResult
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
export_format (ExportFormat)
default_language (str | None)
- Return type:
Examples
>>> from pathlib import Path >>> from scikitplot.corpus._pipeline import create_corpus >>> result = create_corpus( ... input_file=Path("chapter01.txt"), ... output_path=Path("output/chapter01.csv"), ... ) >>> len(result.documents) 312