to_numpy_arrays#

scikitplot.corpus.to_numpy_arrays(documents, *, include_text=True, include_raw_tensor=True, include_embedding=True, include_metadata=True, dtype_map=None)[source]#

Convert documents to a dict of NumPy arrays suitable for batch ML.

Parameters:
documentslist[CorpusDocument]

Documents to convert.

include_textbool, optional

Include texts column (list of str, empty str for None). Default: True.

include_raw_tensorbool, optional

Stack raw_tensor fields when all documents share the same shape. Skipped when shapes differ. Default: True.

include_embeddingbool, optional

Stack embedding fields when all documents have embeddings. Default: True.

include_metadatabool, optional

Include doc_ids, input_paths, source_types columns. Default: True.

dtype_mapdict[str, Any] or None, optional

Override dtypes, e.g. {"raw_tensor": "float32"}. Default: None.

Returns:
dict[str, Any]

Column dict. Keys depend on include_ flags:

  • "texts" — list[str]

  • "raw_tensors" — ndarray shape (N, H, W, C) or (N, S) (only when all shapes match)

  • "embeddings" — ndarray shape (N, D)

  • "doc_ids" — list[str]

  • "input_paths" — list[str]

  • "source_types" — list[str]

  • "modalities" — list[str]

  • "content_hashes" — list[str | None]

Parameters:
Return type:

dict[str, Any]

Notes

Requires numpy. Raises ImportError when not installed.

Examples

>>> arrays = to_numpy_arrays(docs, include_raw_tensor=True)
>>> arrays["raw_tensors"].shape  # (N, H, W, C) for image batch
(32, 224, 224, 3)
>>> arrays["embeddings"].shape  # (N, D)
(32, 384)