MultimodalEmbeddingEngine#

class scikitplot.corpus.MultimodalEmbeddingEngine(text_backend='sentence_transformers', text_model='all-MiniLM-L6-v2', text_custom_fn=None, image_backend='clip', image_model='openai/clip-vit-base-patch32', image_custom_fn=None, audio_backend='whisper', audio_model='openai/whisper-base', audio_custom_fn=None, multimodal_fusion='mean', projection_dim=None, custom_projection_fn=None, normalize=True, batch_size=32, device=None, cache_dir=None, enable_cache=True)[source]#

Unified embedding engine for any CorpusDocument modality — text, image, audio, video, or multimodal.

Routes each document to the appropriate backend by inspecting doc.modality, optionally projects all vectors to a common projection_dim, and stores the result in doc.embedding.

Parameters:
text_backend{“sentence_transformers”, “openai”, “custom”}, optional

Text embedding backend. Default: "sentence_transformers".

text_modelstr, optional

Model name for the text backend. Default: "all-MiniLM-L6-v2".

text_custom_fncallable or None, optional

Custom text embed function Callable[[list[str]], ndarray]. Required when text_backend="custom". Default: None.

image_backend{“clip”, “open_clip”, “custom”}, optional

Image embedding backend. Default: "clip".

image_modelstr, optional

CLIP/ViT model name. Default: "openai/clip-vit-base-patch32".

image_custom_fncallable or None, optional

Custom image embed function Callable[[list[ndarray]], ndarray]. Required when image_backend="custom". Default: None.

audio_backend{“whisper”, “wav2vec”, “custom”}, optional

Audio embedding backend. Default: "whisper".

audio_modelstr, optional

Whisper model size or HuggingFace model id. Default: "openai/whisper-base".

audio_custom_fncallable or None, optional

Custom audio embed function Callable[[list[ndarray]], ndarray]. Required when audio_backend="custom". Default: None.

multimodal_fusion{“mean”, “concat”, “text_only”, “image_only”}, optional

How to combine text + image vectors for MULTIMODAL docs. "mean" averages the two vectors (requires same dim or projection_dim set). "concat" concatenates them (output dim = text_dim + image_dim). Default: "mean".

projection_dimint or None, optional

If set, project every embedding to this dimension via a linear map. Unifies incompatible backend dimensions so all modalities share one embedding space. Default: None (no projection).

custom_projection_fncallable or None, optional

Override the auto-generated random projection with a learned one, e.g. a trained linear adapter. Callable[[ndarray (N, D)], ndarray (N, projection_dim)]. Default: None.

normalizebool, optional

L2-normalise all output embeddings. Default: True.

batch_sizeint, optional

Items per forward pass for each backend. Default: 32.

devicestr or None, optional

Torch device ("cpu", "cuda", "mps"). None lets each backend auto-select. Default: None.

cache_dirpathlib.Path or None, optional

Cache directory for embeddings. None uses the text engine’s default cache. Default: None.

enable_cachebool, optional

Enable/disable embedding cache. Default: True.

Parameters:
  • text_backend (str)

  • text_model (str)

  • text_custom_fn (Callable | None)

  • image_backend (str)

  • image_model (str)

  • image_custom_fn (Callable | None)

  • audio_backend (str)

  • audio_model (str)

  • audio_custom_fn (Callable | None)

  • multimodal_fusion (str)

  • projection_dim (int | None)

  • custom_projection_fn (Callable | None)

  • normalize (bool)

  • batch_size (int)

  • device (str | None)

  • cache_dir (Path | None)

  • enable_cache (bool)

Notes

Projection dimension choice: Set projection_dim to the hidden size of your target LLM’s embedding layer. For GPT-4 / text-embedding-3-large this is 3072; for text-embedding-3-small / all-MiniLM-L6-v2 it is 384-768.

Cache key includes modality + backend + model name + source path + mtime + n_items — changing any of these invalidates the cache.

Thread safety: Backends are lazily loaded and protected by a threading.Lock per backend. Safe for concurrent reads after warm-up.

Examples

Text + image documents in one call:

>>> engine = MultimodalEmbeddingEngine(
...     projection_dim=512,
...     image_backend="clip",
... )
>>> docs = engine.embed_documents(docs)
>>> docs[0].embedding.shape
(512,)

GPT fine-tuning pipeline:

>>> engine = MultimodalEmbeddingEngine(
...     text_backend="openai",
...     text_model="text-embedding-3-small",
...     projection_dim=1536,
... )
>>> docs = engine.embed_documents(text_docs)
>>> from scikitplot.corpus._embeddings._multimodal_embedding import (
...     LLMTrainingExporter,
... )
>>> exporter = LLMTrainingExporter(engine)
>>> exporter.to_openai_finetuning_jsonl(docs, Path("train.jsonl"))
VALID_AUDIO_BACKENDS: ClassVar[tuple[str, ...]] = ('whisper', 'wav2vec', 'custom')#
VALID_FUSION: ClassVar[tuple[str, ...]] = ('mean', 'concat', 'text_only', 'image_only')#
VALID_IMAGE_BACKENDS: ClassVar[tuple[str, ...]] = ('clip', 'open_clip', 'custom')#
VALID_TEXT_BACKENDS: ClassVar[tuple[str, ...]] = ('sentence_transformers', 'openai', 'custom')#
audio_backend: str = 'whisper'#
audio_custom_fn: Callable | None = None#
audio_model: str = 'openai/whisper-base'#
batch_size: int = 32#
cache_dir: Path | None = None#
custom_projection_fn: Callable | None = None#
device: str | None = None#
embed_audio(waveforms)[source]#

Embed a list of audio waveforms via the configured audio backend.

Parameters:
waveformslist[ndarray]

Each waveform: (samples,) float32, 16 kHz.

Returns:
numpy.ndarray

Shape (N, D) float32.

Parameters:

waveforms (list[ndarray[tuple[Any, ...], dtype[_ScalarT]]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_documents(documents, source_path=None)[source]#

Embed all documents in-place (via doc.replace(embedding=...)) and return the updated list.

Dispatches by doc.modality:

Parameters:
documentslist[CorpusDocument]

Documents to embed.

source_pathpathlib.Path or None, optional

Used as the cache-key anchor. Pass the source file path for per-file caching. Default: None (no file cache).

Returns:
list[CorpusDocument]

Same list with embedding fields populated.

Parameters:
Return type:

list[Any]

embed_documents_with_cache(documents, source_path)[source]#

Embed documents with SHA-256 cache keyed to source_path.

Cache key: SHA256(modality_tag + backend + model + path + mtime + N)[:24].

Parameters:
documentslist[CorpusDocument]

Documents to embed.

source_pathpathlib.Path

Source file path. Used to build the cache key (path + mtime).

Returns:
list[CorpusDocument]

Documents with embeddings populated.

Parameters:
Return type:

list[Any]

embed_images(arrays)[source]#

Embed a list of raw image arrays via the configured image backend.

Parameters:
arrayslist[ndarray]

Each array: (H, W, C) uint8 RGB.

Returns:
numpy.ndarray

Shape (N, D) float32.

Raises:
ImportError

If the required image backend library is not installed.

Parameters:

arrays (list[ndarray[tuple[Any, ...], dtype[_ScalarT]]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_texts(texts)[source]#

Embed a list of strings via the configured text backend.

Parameters:
textslist[str]

Non-empty list of strings.

Returns:
numpy.ndarray

Shape (N, D) float32.

Parameters:

texts (list[str])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

embed_video(frame_sequences, n_sample_frames=8)[source]#

Embed video by sampling frames and mean-pooling CLIP embeddings.

Parameters:
frame_sequenceslist[ndarray]

Each array: (T, H, W, C) uint8 — T frames, channels-last.

n_sample_framesint, optional

Frames to sample uniformly. Default: 8.

Returns:
numpy.ndarray

Shape (N, D) float32.

Parameters:
Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

enable_cache: bool = True#
image_backend: str = 'clip'#
image_custom_fn: Callable | None = None#
image_model: str = 'openai/clip-vit-base-patch32'#
multimodal_fusion: str = 'mean'#
normalize: bool = True#
projection_dim: int | None = None#
text_backend: str = 'sentence_transformers'#
text_custom_fn: Callable | None = None#
text_model: str = 'all-MiniLM-L6-v2'#