EmbeddingEngine#

class scikitplot.corpus.EmbeddingEngine(model_name='paraphrase-multilingual-mpnet-base-v2', backend='sentence_transformers', custom_fn=None, cache_dir=None, enable_cache=True, batch_size=64, normalize=True, dtype=<class 'numpy.float32'>, show_progress_bar=False, device=None)[source]#

Multi-backend sentence embedding engine with SHA-256 file caching.

Produces a 2-D float32 numpy array of shape (n_texts, dim) for a list of input strings. Embeddings are cached to .npy files keyed by (model_name, input_path, mtime, n_texts) so that unchanged corpora are served from disk in O(1).

Parameters:

model_namestr, optional: Embedding model identifier. Interpretation depends on backend. For sentence_transformers, any HuggingFace model name. For openai, any OpenAI embedding model name. Ignored when backend="custom". Default: "paraphrase-multilingual-mpnet-base-v2".
backend{“sentence_transformers”, “openai”, “custom”}, optional: Which embedding backend to use. Default: "sentence_transformers".
custom_fncallable or None, optional: User-supplied Callable[[list[str]], np.ndarray]. Required when backend="custom". Ignored otherwise.
cache_dirpathlib.Path or None, optional: Directory for .npy cache files. Created if absent. None uses ~/.cache/scikitplot/embeddings. Pass pathlib.Path(os.devnull) to disable caching.
enable_cachebool, optional: Set to False to completely disable file caching (always re-computes). Default: True.
batch_sizeint, optional: Number of texts per encoding batch. Relevant for sentence_transformers and openai backends. Default: 64.
normalizebool, optional: L2-normalise output vectors to unit norm (required for cosine / inner-product similarity search). Default: True.
dtypenumpy.dtype, optional: Output dtype. Default: numpy.float32.
show_progress_barbool, optional: Show a tqdm progress bar during encoding (sentence_transformers only). Default: False.
devicestr or None, optional: PyTorch device for sentence_transformers ("cpu", "cuda", "mps"). None lets the library choose. Default: None.

Attributes:

VALID_BACKENDStuple of str: Class variable. All accepted backend names.

Raises:

ValueError: If backend="custom" but custom_fn is None.
ValueError: If batch_size or dtype are invalid.
ImportError: At call time if the required backend library is not installed.

Parameters:

model_name (str)
backend (str)
custom_fn (Callable[[List[str]], ndarray[tuple[Any, ...], dtype[float32]]] | None)
cache_dir (Path | None)
enable_cache (bool)
batch_size (int)
normalize (bool)
dtype (Any)
show_progress_bar (bool)
device (str | None)

See also

scikitplot.corpus.pipeline.CorpusPipeline: Integrates this engine.
scikitplot.corpus._embeddings._multimodal_embedding.MultimodalEmbeddingEngine: Extends this engine with image, audio, and video modalities plus projection layer and LLM training export.

Notes

Thread safety: The internal model cache (_embed_fn) is initialised lazily and protected by a threading.Lock.

Cache invalidation: The cache key includes the source file’s modification time. Any write to the source file (even a metadata update via touch) invalidates the cache. If this is undesirable, pass a stable input_path (e.g. a logical identifier rather than the real path).

Normalisation: When normalize=True, zero-norm vectors (e.g. empty-string inputs) are left as zero vectors rather than producing NaN. The normalisation guard in SimilarityIndex will warn if any zero vectors are detected at search time.

Examples

Default usage (sentence_transformers):

>>> engine = EmbeddingEngine()
>>> texts = ["Hello world.", "Second sentence."]
>>> vecs = engine.embed(texts)
>>> vecs.shape
(2, 768)

Custom callable backend:

>>> import numpy as np
>>> engine = EmbeddingEngine(
...     backend="custom",
...     custom_fn=lambda texts: np.zeros((len(texts), 64), dtype=np.float32),
... )
>>> engine.embed(["Hello."]).shape
(1, 64)

With source-file cache:

>>> from pathlib import Path
>>> vecs, from_cache = engine.embed_with_cache(
...     texts,
...     input_path=Path("corpus.txt"),
... )

VALID_BACKENDS: ClassVar[tuple[str, ...]] = ('sentence_transformers', 'openai', 'custom')#

Accepted backend values.

For image/audio/video embeddings use MultimodalEmbeddingEngine which supports "clip", "whisper", "wav2vec" in addition.

backend: str = 'sentence_transformers'#

batch_size: int = 64#

cache_dir: Path | None = None#

custom_fn: Callable[[List[str]], ndarray[tuple[Any, ...], dtype[float32]]] | None = None#

device: str | None = None#

dtype[source]#: alias of float32

embed(texts)[source]#

Compute embeddings for texts without caching.

Parameters:

textslist of str: Non-empty list of text strings. Empty strings produce zero vectors; they are not filtered here (filtering belongs in the pipeline).

Returns:

numpy.ndarray: Array of shape (len(texts), dim) with dtype self.dtype.

Raises:

ValueError: If texts is empty.
ImportError: If the required backend library is not installed.

Parameters:

texts (list[str])

Return type:

ndarray[tuple[Any, …], dtype[float32]]

Examples

>>> engine = EmbeddingEngine(
...     backend="custom",
...     custom_fn=lambda t: np.zeros((len(t), 32), dtype=np.float32),
... )
>>> engine.embed(["hello"]).shape
(1, 32)

embed_documents(documents, input_path=None)[source]#

Embed a list of CorpusDocument instances in-place (sets doc.embedding on each).

Parameters:

documentslist of CorpusDocument: Documents to embed. Each must have a non-empty text field.
input_pathpathlib.Path or None, optional: Source path for cache key. None disables caching.

Returns:

list of CorpusDocument: The same list with embedding fields populated (via replace() — originals are not mutated).

Parameters:

documents (list[Any])
input_path (Path | None)

Return type:

list[Any]

Examples

>>> docs = list(reader.get_documents())
>>> docs = engine.embed_documents(docs)
>>> docs[0].has_embedding
True

embed_with_cache(texts, input_path)[source]#

Compute embeddings with file caching keyed to input_path.

Parameters:

textslist of str: Text strings to embed.
input_pathpathlib.Path: Path to the source file that generated texts. Used to build the cache key (path + mtime + len(texts)).

Returns:

embeddingsnumpy.ndarray: Array of shape (len(texts), dim).
from_cachebool: True if the result was loaded from disk cache.

Raises:

ValueError: If texts is empty.
OSError: If the cache directory cannot be created.

Parameters:

texts (list[str])
input_path (Path)

Return type:

tuple[ndarray[tuple[Any, …], dtype[float32]], bool]

Examples

>>> vecs, cached = engine.embed_with_cache(texts, Path("corpus.txt"))
>>> cached  # True on second call with same inputs
False

enable_cache: bool = True#

model_name: str = 'paraphrase-multilingual-mpnet-base-v2'#

normalize: bool = True#

show_progress_bar: bool = False#