MultimodalEmbeddingEngine#
- class scikitplot.corpus.MultimodalEmbeddingEngine(text_backend='sentence_transformers', text_model='all-MiniLM-L6-v2', text_custom_fn=None, image_backend='clip', image_model='openai/clip-vit-base-patch32', image_custom_fn=None, audio_backend='whisper', audio_model='openai/whisper-base', audio_custom_fn=None, multimodal_fusion='mean', projection_dim=None, custom_projection_fn=None, normalize=True, batch_size=32, device=None, cache_dir=None, enable_cache=True)[source]#
Unified embedding engine for any
CorpusDocumentmodality — text, image, audio, video, or multimodal.Routes each document to the appropriate backend by inspecting
doc.modality, optionally projects all vectors to a commonprojection_dim, and stores the result indoc.embedding.- Parameters:
- text_backend{“sentence_transformers”, “openai”, “custom”}, optional
Text embedding backend. Default:
"sentence_transformers".- text_modelstr, optional
Model name for the text backend. Default:
"all-MiniLM-L6-v2".- text_custom_fncallable or None, optional
Custom text embed function
Callable[[list[str]], ndarray]. Required whentext_backend="custom". Default:None.- image_backend{“clip”, “open_clip”, “custom”}, optional
Image embedding backend. Default:
"clip".- image_modelstr, optional
CLIP/ViT model name. Default:
"openai/clip-vit-base-patch32".- image_custom_fncallable or None, optional
Custom image embed function
Callable[[list[ndarray]], ndarray]. Required whenimage_backend="custom". Default:None.- audio_backend{“whisper”, “wav2vec”, “custom”}, optional
Audio embedding backend. Default:
"whisper".- audio_modelstr, optional
Whisper model size or HuggingFace model id. Default:
"openai/whisper-base".- audio_custom_fncallable or None, optional
Custom audio embed function
Callable[[list[ndarray]], ndarray]. Required whenaudio_backend="custom". Default:None.- multimodal_fusion{“mean”, “concat”, “text_only”, “image_only”}, optional
How to combine text + image vectors for
MULTIMODALdocs."mean"averages the two vectors (requires same dim orprojection_dimset)."concat"concatenates them (output dim = text_dim + image_dim). Default:"mean".- projection_dimint or None, optional
If set, project every embedding to this dimension via a linear map. Unifies incompatible backend dimensions so all modalities share one embedding space. Default:
None(no projection).- custom_projection_fncallable or None, optional
Override the auto-generated random projection with a learned one, e.g. a trained linear adapter.
Callable[[ndarray (N, D)], ndarray (N, projection_dim)]. Default:None.- normalizebool, optional
L2-normalise all output embeddings. Default:
True.- batch_sizeint, optional
Items per forward pass for each backend. Default:
32.- devicestr or None, optional
Torch device (
"cpu","cuda","mps").Nonelets each backend auto-select. Default:None.- cache_dirpathlib.Path or None, optional
Cache directory for embeddings.
Noneuses the text engine’s default cache. Default:None.- enable_cachebool, optional
Enable/disable embedding cache. Default:
True.
- Parameters:
text_backend (str)
text_model (str)
text_custom_fn (Callable | None)
image_backend (str)
image_model (str)
image_custom_fn (Callable | None)
audio_backend (str)
audio_model (str)
audio_custom_fn (Callable | None)
multimodal_fusion (str)
projection_dim (int | None)
custom_projection_fn (Callable | None)
normalize (bool)
batch_size (int)
device (str | None)
cache_dir (Path | None)
enable_cache (bool)
Notes
Projection dimension choice: Set
projection_dimto the hidden size of your target LLM’s embedding layer. For GPT-4 /text-embedding-3-largethis is 3072; fortext-embedding-3-small/all-MiniLM-L6-v2it is 384-768.Cache key includes modality + backend + model name + source path + mtime + n_items — changing any of these invalidates the cache.
Thread safety: Backends are lazily loaded and protected by a
threading.Lockper backend. Safe for concurrent reads after warm-up.Examples
Text + image documents in one call:
>>> engine = MultimodalEmbeddingEngine( ... projection_dim=512, ... image_backend="clip", ... ) >>> docs = engine.embed_documents(docs) >>> docs[0].embedding.shape (512,)
GPT fine-tuning pipeline:
>>> engine = MultimodalEmbeddingEngine( ... text_backend="openai", ... text_model="text-embedding-3-small", ... projection_dim=1536, ... ) >>> docs = engine.embed_documents(text_docs) >>> from scikitplot.corpus._embeddings._multimodal_embedding import ( ... LLMTrainingExporter, ... ) >>> exporter = LLMTrainingExporter(engine) >>> exporter.to_openai_finetuning_jsonl(docs, Path("train.jsonl"))
- embed_documents(documents, source_path=None)[source]#
Embed all documents in-place (via
doc.replace(embedding=...)) and return the updated list.Dispatches by
doc.modality:TEXT→embed_textsIMAGE→embed_imagesAUDIO→embed_audioVIDEO→embed_videoMULTIMODAL→ fused text + image (seemultimodal_fusion)fallback: treats as TEXT using
doc.textor""
- Parameters:
- documentslist[CorpusDocument]
Documents to embed.
- source_pathpathlib.Path or None, optional
Used as the cache-key anchor. Pass the source file path for per-file caching. Default:
None(no file cache).
- Returns:
- list[CorpusDocument]
Same list with
embeddingfields populated.
- Parameters:
- Return type:
- embed_documents_with_cache(documents, source_path)[source]#
Embed documents with SHA-256 cache keyed to source_path.
Cache key:
SHA256(modality_tag + backend + model + path + mtime + N)[:24].
- embed_images(arrays)[source]#
Embed a list of raw image arrays via the configured image backend.
- Parameters:
- arrayslist[ndarray]
Each array:
(H, W, C)uint8 RGB.
- Returns:
- numpy.ndarray
Shape
(N, D)float32.
- Raises:
- ImportError
If the required image backend library is not installed.
- Parameters:
- Return type:
- embed_video(frame_sequences, n_sample_frames=8)[source]#
Embed video by sampling frames and mean-pooling CLIP embeddings.
- Parameters:
- frame_sequenceslist[ndarray]
Each array:
(T, H, W, C)uint8 — T frames, channels-last.- n_sample_framesint, optional
Frames to sample uniformly. Default: 8.
- Returns:
- numpy.ndarray
Shape
(N, D)float32.
- Parameters:
- Return type: