MultimodalEmbeddingEngine#

class scikitplot.corpus.MultimodalEmbeddingEngine(text_backend='sentence_transformers', text_model='all-MiniLM-L6-v2', text_custom_fn=None, image_backend='clip', image_model='openai/clip-vit-base-patch32', image_custom_fn=None, audio_backend='whisper', audio_model='openai/whisper-base', audio_custom_fn=None, multimodal_fusion='mean', projection_dim=None, custom_projection_fn=None, normalize=True, batch_size=32, device=None, cache_dir=None, enable_cache=True)[source]#

Unified embedding engine for any CorpusDocument modality — text, image, audio, video, or multimodal.

Routes each document to the appropriate backend by inspecting doc.modality, optionally projects all vectors to a common projection_dim, and stores the result in doc.embedding.

Parameters:

text_backend{“sentence_transformers”, “openai”, “custom”}, optional: Text embedding backend. Default: "sentence_transformers".
text_modelstr, optional: Model name for the text backend. Default: "all-MiniLM-L6-v2".
text_custom_fncallable or None, optional: Custom text embed function Callable[[list[str]], ndarray]. Required when text_backend="custom". Default: None.
image_backend{“clip”, “open_clip”, “custom”}, optional: Image embedding backend. Default: "clip".
image_modelstr, optional: CLIP/ViT model name. Default: "openai/clip-vit-base-patch32".
image_custom_fncallable or None, optional: Custom image embed function Callable[[list[ndarray]], ndarray]. Required when image_backend="custom". Default: None.
audio_backend{“whisper”, “wav2vec”, “custom”}, optional: Audio embedding backend. Default: "whisper".
audio_modelstr, optional: Whisper model size or HuggingFace model id. Default: "openai/whisper-base".
audio_custom_fncallable or None, optional: Custom audio embed function Callable[[list[ndarray]], ndarray]. Required when audio_backend="custom". Default: None.
multimodal_fusion{“mean”, “concat”, “text_only”, “image_only”}, optional: How to combine text + image vectors for MULTIMODAL docs. "mean" averages the two vectors (requires same dim or projection_dim set). "concat" concatenates them (output dim = text_dim + image_dim). Default: "mean".
projection_dimint or None, optional: If set, project every embedding to this dimension via a linear map. Unifies incompatible backend dimensions so all modalities share one embedding space. Default: None (no projection).
custom_projection_fncallable or None, optional: Override the auto-generated random projection with a learned one, e.g. a trained linear adapter. Callable[[ndarray (N, D)], ndarray (N, projection_dim)]. Default: None.
normalizebool, optional: L2-normalise all output embeddings. Default: True.
batch_sizeint, optional: Items per forward pass for each backend. Default: 32.
devicestr or None, optional: Torch device ("cpu", "cuda", "mps"). None lets each backend auto-select. Default: None.
cache_dirpathlib.Path or None, optional: Cache directory for embeddings. None uses the text engine’s default cache. Default: None.
enable_cachebool, optional: Enable/disable embedding cache. Default: True.

Parameters:

text_backend (str)
text_model (str)
text_custom_fn (Callable | None)
image_backend (str)
image_model (str)
image_custom_fn (Callable | None)
audio_backend (str)
audio_model (str)
audio_custom_fn (Callable | None)
multimodal_fusion (str)
projection_dim (int | None)
custom_projection_fn (Callable | None)
normalize (bool)
batch_size (int)
device (str | None)
cache_dir (Path | None)
enable_cache (bool)

Notes

Projection dimension choice: Set projection_dim to the hidden size of your target LLM’s embedding layer. For GPT-4 / text-embedding-3-large this is 3072; for text-embedding-3-small / all-MiniLM-L6-v2 it is 384-768.

Cache key includes modality + backend + model name + source path + mtime + n_items — changing any of these invalidates the cache.

Thread safety: Backends are lazily loaded and protected by a threading.Lock per backend. Safe for concurrent reads after warm-up.

Examples

Text + image documents in one call:

>>> engine = MultimodalEmbeddingEngine(
...     projection_dim=512,
...     image_backend="clip",
... )
>>> docs = engine.embed_documents(docs)
>>> docs[0].embedding.shape
(512,)

GPT fine-tuning pipeline:

>>> engine = MultimodalEmbeddingEngine(
...     text_backend="openai",
...     text_model="text-embedding-3-small",
...     projection_dim=1536,
... )
>>> docs = engine.embed_documents(text_docs)
>>> from scikitplot.corpus._embeddings._multimodal_embedding import (
...     LLMTrainingExporter,
... )
>>> exporter = LLMTrainingExporter(engine)
>>> exporter.to_openai_finetuning_jsonl(docs, Path("train.jsonl"))

Developer note:

projection_dim with custom_projection_fn=None uses a random orthonormal projection (QR decomposition). For production fine-tuning replace this with a learned adapter trained on your downstream task.

VALID_AUDIO_BACKENDS: ClassVar[tuple[str, ...]] = ('whisper', 'wav2vec', 'custom')#

VALID_FUSION: ClassVar[tuple[str, ...]] = ('mean', 'concat', 'text_only', 'image_only')#

VALID_IMAGE_BACKENDS: ClassVar[tuple[str, ...]] = ('clip', 'open_clip', 'custom')#

VALID_TEXT_BACKENDS: ClassVar[tuple[str, ...]] = ('sentence_transformers', 'openai', 'custom')#

audio_backend: str = 'whisper'#

audio_custom_fn: Callable | None = None#

audio_model: str = 'openai/whisper-base'#

batch_size: int = 32#

cache_dir: Path | None = None#

custom_projection_fn: Callable | None = None#

device: str | None = None#

embed_audio(waveforms)[source]#

Embed a list of audio waveforms via the configured audio backend.

Parameters:

waveformslist[ndarray]: Each waveform: (samples,) float32, 16 kHz.

Returns:

numpy.ndarray: Shape (N, D) float32.

Parameters:

waveforms (list[ndarray[tuple[Any, ...], dtype[_ScalarT]]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_documents(documents, input_path=None)[source]#

Embed all documents in-place (via doc.replace(embedding=...)) and return the updated list.

Dispatches by doc.modality:

TEXT → embed_texts
IMAGE → embed_images
AUDIO → embed_audio
VIDEO → embed_video
MULTIMODAL → fused text + image (see multimodal_fusion)
fallback: treats as TEXT using doc.text or ""

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None: Documents to embed. None and empty list both return []. A single CorpusDocument is wrapped in a list automatically. None entries inside a list are silently filtered out.
input_pathpathlib.Path or None, optional: Used as the cache-key anchor. Pass the source file path for per-file caching. Default: None (no file cache).

Returns:

list[CorpusDocument]: Always a list. [] when input normalises to empty. embedding field populated on every returned document.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
input_path (Path | None)

Return type:

list[CorpusDocument]

embed_documents_with_cache(documents, input_path)[source]#

Embed documents with SHA-256 cache keyed to input_path.

Cache key: SHA256(modality_tag + backend + model + path + mtime + N)[:24].

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None: Documents to embed. None, a single doc, or a list (with optional None entries that are silently filtered).
input_pathpathlib.Path: Source file path. Used to build the cache key (path + mtime).

Returns:

list[CorpusDocument]: Documents with embeddings populated. [] when input is empty.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
input_path (Path)

Return type:

list[CorpusDocument]

embed_images(arrays)[source]#

Embed a list of raw image arrays via the configured image backend.

Parameters:

arrayslist[ndarray]: Each array: (H, W, C) uint8 RGB.

Returns:

numpy.ndarray: Shape (N, D) float32.

Raises:

ImportError: If the required image backend library is not installed.

Parameters:

arrays (list[ndarray[tuple[Any, ...], dtype[_ScalarT]]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_texts(texts)[source]#

Embed a list of strings via the configured text backend.

Parameters:

textslist[str]: Non-empty list of strings.

Returns:

numpy.ndarray: Shape (N, D) float32.

Parameters:

texts (list[str])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_video(frame_sequences, n_sample_frames=8)[source]#

Embed video by sampling frames and mean-pooling CLIP embeddings.

Parameters:

frame_sequenceslist[ndarray]: Each array: (T, H, W, C) uint8 — T frames, channels-last.
n_sample_framesint, optional: Frames to sample uniformly. Default: 8.

Returns:

numpy.ndarray: Shape (N, D) float32.

Parameters:

frame_sequences (list[ndarray[tuple[Any, ...], dtype[_ScalarT]]])
n_sample_frames (int)

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

enable_cache: bool = True#

image_backend: str = 'clip'#

image_custom_fn: Callable | None = None#

image_model: str = 'openai/clip-vit-base-patch32'#

multimodal_fusion: str = 'mean'#

normalize: bool = True#

projection_dim: int | None = None#

text_backend: str = 'sentence_transformers'#

text_custom_fn: Callable | None = None#

text_model: str = 'all-MiniLM-L6-v2'#