export_documents#

scikitplot.corpus.export_documents(documents, output_path, fmt, *, include_embedding=True, json_indent=2, parquet_compression='snappy')[source]#

Export a list of documents to output_path in the given format.

Parameters:
documentslist of CorpusDocument

Documents to export. May be empty (produces an empty file/dataset).

output_pathpathlib.Path

Destination file or directory path.

  • File formats (CSV, JSON, JSONL, Pickle, Joblib, NumPy, pandas, Parquet, Polars): path to the output file.

  • Directory formats (HuggingFace, MLflow): path to the root directory / artifact path.

fmtExportFormat

Target export format.

include_embeddingbool, optional

When True (default), embedding vectors are included in the output where the format supports them (JSONL, JSON, Pickle, Joblib, NumPy). Embeddings are always included for NumPy. For CSV and Parquet (tabular), embeddings are excluded regardless of this flag to avoid storing variable-length arrays in cells.

json_indentint or None, optional

Indentation for JSON output. None produces compact JSON. Default: 2.

parquet_compressionstr, optional

Compression codec for Parquet output ("snappy", "gzip", "brotli", "zstd", "none"). Default: "snappy".

Returns:
pathlib.Path

The path that was written to (same as output_path).

Raises:
ValueError

If fmt is ExportFormat.NUMPY and no documents have embeddings, or if the embedding dimensions are inconsistent.

ImportError

If the required optional library for the format is not installed.

OSError

If the output directory cannot be created or the file cannot be written.

Parameters:
  • documents (list[CorpusDocument])

  • output_path (Path)

  • fmt (ExportFormat)

  • include_embedding (bool)

  • json_indent (int | None)

  • parquet_compression (str)

Return type:

Path

See also

scikitplot.corpus._schema.ExportFormat

Enumeration of all formats.

Notes

Atomic writes: All file-based formats are written to a .tmp sibling first, then renamed atomically. Interrupted exports leave no partial files at the final path.

Embedding in tabular formats: CSV and Parquet omit embeddings because storing a float32 vector per row in a tabular cell is impractical. Use PICKLE, JOBLIB, or NUMPY to preserve embeddings.

Examples

CSV export (zero dependencies):

>>> from pathlib import Path
>>> export_documents(docs, Path("corpus.csv"), ExportFormat.CSV)
PosixPath('corpus.csv')

JSONL with embeddings:

>>> export_documents(
...     docs,
...     Path("corpus.jsonl"),
...     ExportFormat.JSONL,
...     include_embedding=True,
... )

NumPy embedding matrix:

>>> export_documents(docs, Path("embeddings.npy"), ExportFormat.NUMPY)
PosixPath('embeddings.npy')