to_numpy_arrays#
- scikitplot.corpus.to_numpy_arrays(documents, *, include_text=True, include_raw_tensor=True, include_embedding=True, include_metadata=True, dtype_map=None)[source]#
Convert documents to a dict of NumPy arrays suitable for batch ML.
- Parameters:
- documentslist[CorpusDocument]
Documents to convert.
- include_textbool, optional
Include
textscolumn (list of str, empty str for None). Default:True.- include_raw_tensorbool, optional
Stack
raw_tensorfields when all documents share the same shape. Skipped when shapes differ. Default:True.- include_embeddingbool, optional
Stack
embeddingfields when all documents have embeddings. Default:True.- include_metadatabool, optional
Include
doc_ids,input_paths,source_typescolumns. Default:True.- dtype_mapdict[str, Any] or None, optional
Override dtypes, e.g.
{"raw_tensor": "float32"}. Default:None.
- Returns:
- dict[str, Any]
Column dict. Keys depend on include_ flags:
"texts"— list[str]"raw_tensors"— ndarray shape(N, H, W, C)or(N, S)(only when all shapes match)"embeddings"— ndarray shape(N, D)"doc_ids"— list[str]"input_paths"— list[str]"source_types"— list[str]"modalities"— list[str]"content_hashes"— list[str | None]
- Parameters:
- Return type:
Notes
Requires
numpy. RaisesImportErrorwhen not installed.Examples
>>> arrays = to_numpy_arrays(docs, include_raw_tensor=True) >>> arrays["raw_tensors"].shape # (N, H, W, C) for image batch (32, 224, 224, 3) >>> arrays["embeddings"].shape # (N, D) (32, 384)