export_documents#
- scikitplot.corpus.export_documents(documents, output_path, fmt, *, include_embedding=True, json_indent=2, parquet_compression='snappy')[source]#
Export a list of documents to
output_pathin the given format.- Parameters:
- documentslist of CorpusDocument
Documents to export. May be empty (produces an empty file/dataset).
- output_pathpathlib.Path
Destination file or directory path.
File formats (CSV, JSON, JSONL, Pickle, Joblib, NumPy, pandas, Parquet, Polars): path to the output file.
Directory formats (HuggingFace, MLflow): path to the root directory / artifact path.
- fmtExportFormat
Target export format.
- include_embeddingbool, optional
When
True(default), embedding vectors are included in the output where the format supports them (JSONL, JSON, Pickle, Joblib, NumPy). Embeddings are always included for NumPy. For CSV and Parquet (tabular), embeddings are excluded regardless of this flag to avoid storing variable-length arrays in cells.- json_indentint or None, optional
Indentation for JSON output.
Noneproduces compact JSON. Default:2.- parquet_compressionstr, optional
Compression codec for Parquet output (
"snappy","gzip","brotli","zstd","none"). Default:"snappy".
- Returns:
- pathlib.Path
The path that was written to (same as
output_path).
- Raises:
- ValueError
If
fmtisExportFormat.NUMPYand no documents have embeddings, or if the embedding dimensions are inconsistent.- ImportError
If the required optional library for the format is not installed.
- OSError
If the output directory cannot be created or the file cannot be written.
- Parameters:
- Return type:
See also
scikitplot.corpus._schema.ExportFormatEnumeration of all formats.
Notes
Atomic writes: All file-based formats are written to a
.tmpsibling first, then renamed atomically. Interrupted exports leave no partial files at the final path.Embedding in tabular formats: CSV and Parquet omit embeddings because storing a float32 vector per row in a tabular cell is impractical. Use PICKLE, JOBLIB, or NUMPY to preserve embeddings.
Examples
CSV export (zero dependencies):
>>> from pathlib import Path >>> export_documents(docs, Path("corpus.csv"), ExportFormat.CSV) PosixPath('corpus.csv')
JSONL with embeddings:
>>> export_documents( ... docs, ... Path("corpus.jsonl"), ... ExportFormat.JSONL, ... include_embedding=True, ... )
NumPy embedding matrix:
>>> export_documents(docs, Path("embeddings.npy"), ExportFormat.NUMPY) PosixPath('embeddings.npy')