to_numpy_arrays#

scikitplot.corpus.to_numpy_arrays(documents, *, include_text=True, include_raw_tensor=True, include_embedding=True, include_metadata=True, dtype_map=None)[source]#

Convert documents to a dict of NumPy arrays suitable for batch ML.

Parameters:

documentslist[CorpusDocument]: Documents to convert.
include_textbool, optional: Include texts column (list of str, empty str for None). Default: True.
include_raw_tensorbool, optional: Stack raw_tensor fields when all documents share the same shape. Skipped when shapes differ. Default: True.
include_embeddingbool, optional: Stack embedding fields when all documents have embeddings. Default: True.
include_metadatabool, optional: Include doc_ids, input_paths, source_types columns. Default: True.
dtype_mapdict[str, Any] or None, optional: Override dtypes, e.g. {"raw_tensor": "float32"}. Default: None.

Returns:

dict[str, Any]

Column dict. Keys depend on include_ flags:

"texts" — list[str]
"raw_tensors" — ndarray shape (N, H, W, C) or (N, S) (only when all shapes match)
"embeddings" — ndarray shape (N, D)
"doc_ids" — list[str]
"input_paths" — list[str]
"source_types" — list[str]
"modalities" — list[str]
"content_hashes" — list[str | None]

Parameters:

documents (list[Any])
include_text (bool)
include_raw_tensor (bool)
include_embedding (bool)
include_metadata (bool)
dtype_map (dict[str, Any] | None)

Return type:

dict[str, Any]

Notes

Requires numpy. Raises ImportError when not installed.

Examples

>>> arrays = to_numpy_arrays(docs, include_raw_tensor=True)
>>> arrays["raw_tensors"].shape  # (N, H, W, C) for image batch
(32, 224, 224, 3)
>>> arrays["embeddings"].shape  # (N, D)
(32, 384)