EmbeddingEngine#
- class scikitplot.corpus.EmbeddingEngine(model_name='paraphrase-multilingual-mpnet-base-v2', backend='sentence_transformers', custom_fn=None, cache_dir=None, enable_cache=True, batch_size=64, normalize=True, dtype=<class 'numpy.float32'>, show_progress_bar=False, device=None)[source]#
Multi-backend sentence embedding engine with SHA-256 file caching.
Produces a 2-D
float32numpy array of shape(n_texts, dim)for a list of input strings. Embeddings are cached to.npyfiles keyed by(model_name, source_path, mtime, n_texts)so that unchanged corpora are served from disk in O(1).- Parameters:
- model_namestr, optional
Embedding model identifier. Interpretation depends on
backend. Forsentence_transformers, any HuggingFace model name. Foropenai, any OpenAI embedding model name. Ignored whenbackend="custom". Default:"paraphrase-multilingual-mpnet-base-v2".- backend{“sentence_transformers”, “openai”, “custom”}, optional
Which embedding backend to use. Default:
"sentence_transformers".- custom_fncallable or None, optional
User-supplied
Callable[[list[str]], np.ndarray]. Required whenbackend="custom". Ignored otherwise.- cache_dirpathlib.Path or None, optional
Directory for
.npycache files. Created if absent.Noneuses~/.cache/scikitplot/embeddings. Passpathlib.Path(os.devnull)to disable caching.- enable_cachebool, optional
Set to
Falseto completely disable file caching (always re-computes). Default:True.- batch_sizeint, optional
Number of texts per encoding batch. Relevant for
sentence_transformersandopenaibackends. Default:64.- normalizebool, optional
L2-normalise output vectors to unit norm (required for cosine / inner-product similarity search). Default:
True.- dtypenumpy.dtype, optional
Output dtype. Default:
numpy.float32.- show_progress_barbool, optional
Show a tqdm progress bar during encoding (sentence_transformers only). Default:
False.- devicestr or None, optional
PyTorch device for sentence_transformers (
"cpu","cuda","mps").Nonelets the library choose. Default:None.
- Attributes:
- VALID_BACKENDStuple of str
Class variable. All accepted backend names.
- Raises:
- ValueError
If
backend="custom"butcustom_fnisNone.- ValueError
If
batch_sizeordtypeare invalid.- ImportError
At call time if the required backend library is not installed.
- Parameters:
See also
scikitplot.corpus.pipeline.CorpusPipelineIntegrates this engine.
scikitplot.corpus._embeddings._multimodal_embedding.MultimodalEmbeddingEngineExtends this engine with image, audio, and video modalities plus projection layer and LLM training export.
Notes
Thread safety: The internal model cache (
_embed_fn) is initialised lazily and protected by athreading.Lock.Cache invalidation: The cache key includes the source file’s modification time. Any write to the source file (even a metadata update via
touch) invalidates the cache. If this is undesirable, pass a stablesource_path(e.g. a logical identifier rather than the real path).Normalisation: When
normalize=True, zero-norm vectors (e.g. empty-string inputs) are left as zero vectors rather than producing NaN. The normalisation guard inSimilarityIndexwill warn if any zero vectors are detected at search time.Examples
Default usage (sentence_transformers):
>>> engine = EmbeddingEngine() >>> texts = ["Hello world.", "Second sentence."] >>> vecs = engine.embed(texts) >>> vecs.shape (2, 768)
Custom callable backend:
>>> import numpy as np >>> engine = EmbeddingEngine( ... backend="custom", ... custom_fn=lambda texts: np.zeros((len(texts), 64), dtype=np.float32), ... ) >>> engine.embed(["Hello."]).shape (1, 64)
With source-file cache:
>>> from pathlib import Path >>> vecs, from_cache = engine.embed_with_cache( ... texts, ... source_path=Path("corpus.txt"), ... )
- VALID_BACKENDS: ClassVar[tuple[str, ...]] = ('sentence_transformers', 'openai', 'custom')#
Accepted
backendvalues.For image/audio/video embeddings use
MultimodalEmbeddingEnginewhich supports"clip","whisper","wav2vec"in addition.
- embed(texts)[source]#
Compute embeddings for
textswithout caching.- Parameters:
- textslist of str
Non-empty list of text strings. Empty strings produce zero vectors; they are not filtered here (filtering belongs in the pipeline).
- Returns:
- numpy.ndarray
Array of shape
(len(texts), dim)with dtypeself.dtype.
- Raises:
- ValueError
If
textsis empty.- ImportError
If the required backend library is not installed.
- Parameters:
- Return type:
Examples
>>> engine = EmbeddingEngine( ... backend="custom", ... custom_fn=lambda t: np.zeros((len(t), 32), dtype=np.float32), ... ) >>> engine.embed(["hello"]).shape (1, 32)
- embed_documents(documents, source_path=None)[source]#
Embed a list of
CorpusDocumentinstances in-place (setsdoc.embeddingon each).- Parameters:
- documentslist of CorpusDocument
Documents to embed. Each must have a non-empty
textfield.- source_pathpathlib.Path or None, optional
Source path for cache key.
Nonedisables caching.
- Returns:
- list of CorpusDocument
The same list with
embeddingfields populated (viareplace()— originals are not mutated).
- Parameters:
- Return type:
Examples
>>> docs = list(reader.get_documents()) >>> docs = engine.embed_documents(docs) >>> docs[0].has_embedding True
- embed_with_cache(texts, source_path)[source]#
Compute embeddings with file caching keyed to
source_path.- Parameters:
- textslist of str
Text strings to embed.
- source_pathpathlib.Path
Path to the source file that generated
texts. Used to build the cache key (path + mtime + len(texts)).
- Returns:
- embeddingsnumpy.ndarray
Array of shape
(len(texts), dim).- from_cachebool
Trueif the result was loaded from disk cache.
- Raises:
- ValueError
If
textsis empty.- OSError
If the cache directory cannot be created.
- Parameters:
- Return type:
Examples
>>> vecs, cached = engine.embed_with_cache(texts, Path("corpus.txt")) >>> cached # True on second call with same inputs False