EmbeddingEngine#

class scikitplot.corpus.EmbeddingEngine(model_name='paraphrase-multilingual-mpnet-base-v2', backend='sentence_transformers', custom_fn=None, cache_dir=None, enable_cache=True, batch_size=64, normalize=True, dtype=<class 'numpy.float32'>, show_progress_bar=False, device=None)[source]#

Multi-backend sentence embedding engine with SHA-256 file caching.

Produces a 2-D float32 numpy array of shape (n_texts, dim) for a list of input strings. Embeddings are cached to .npy files keyed by (model_name, source_path, mtime, n_texts) so that unchanged corpora are served from disk in O(1).

Parameters:
model_namestr, optional

Embedding model identifier. Interpretation depends on backend. For sentence_transformers, any HuggingFace model name. For openai, any OpenAI embedding model name. Ignored when backend="custom". Default: "paraphrase-multilingual-mpnet-base-v2".

backend{“sentence_transformers”, “openai”, “custom”}, optional

Which embedding backend to use. Default: "sentence_transformers".

custom_fncallable or None, optional

User-supplied Callable[[list[str]], np.ndarray]. Required when backend="custom". Ignored otherwise.

cache_dirpathlib.Path or None, optional

Directory for .npy cache files. Created if absent. None uses ~/.cache/scikitplot/embeddings. Pass pathlib.Path(os.devnull) to disable caching.

enable_cachebool, optional

Set to False to completely disable file caching (always re-computes). Default: True.

batch_sizeint, optional

Number of texts per encoding batch. Relevant for sentence_transformers and openai backends. Default: 64.

normalizebool, optional

L2-normalise output vectors to unit norm (required for cosine / inner-product similarity search). Default: True.

dtypenumpy.dtype, optional

Output dtype. Default: numpy.float32.

show_progress_barbool, optional

Show a tqdm progress bar during encoding (sentence_transformers only). Default: False.

devicestr or None, optional

PyTorch device for sentence_transformers ("cpu", "cuda", "mps"). None lets the library choose. Default: None.

Attributes:
VALID_BACKENDStuple of str

Class variable. All accepted backend names.

Raises:
ValueError

If backend="custom" but custom_fn is None.

ValueError

If batch_size or dtype are invalid.

ImportError

At call time if the required backend library is not installed.

Parameters:

See also

scikitplot.corpus.pipeline.CorpusPipeline

Integrates this engine.

scikitplot.corpus._embeddings._multimodal_embedding.MultimodalEmbeddingEngine

Extends this engine with image, audio, and video modalities plus projection layer and LLM training export.

Notes

Thread safety: The internal model cache (_embed_fn) is initialised lazily and protected by a threading.Lock.

Cache invalidation: The cache key includes the source file’s modification time. Any write to the source file (even a metadata update via touch) invalidates the cache. If this is undesirable, pass a stable source_path (e.g. a logical identifier rather than the real path).

Normalisation: When normalize=True, zero-norm vectors (e.g. empty-string inputs) are left as zero vectors rather than producing NaN. The normalisation guard in SimilarityIndex will warn if any zero vectors are detected at search time.

Examples

Default usage (sentence_transformers):

>>> engine = EmbeddingEngine()
>>> texts = ["Hello world.", "Second sentence."]
>>> vecs = engine.embed(texts)
>>> vecs.shape
(2, 768)

Custom callable backend:

>>> import numpy as np
>>> engine = EmbeddingEngine(
...     backend="custom",
...     custom_fn=lambda texts: np.zeros((len(texts), 64), dtype=np.float32),
... )
>>> engine.embed(["Hello."]).shape
(1, 64)

With source-file cache:

>>> from pathlib import Path
>>> vecs, from_cache = engine.embed_with_cache(
...     texts,
...     source_path=Path("corpus.txt"),
... )
VALID_BACKENDS: ClassVar[tuple[str, ...]] = ('sentence_transformers', 'openai', 'custom')#

Accepted backend values.

For image/audio/video embeddings use MultimodalEmbeddingEngine which supports "clip", "whisper", "wav2vec" in addition.

backend: str = 'sentence_transformers'#
batch_size: int = 64#
cache_dir: Path | None = None#
custom_fn: Callable[[List[str]], ndarray[tuple[Any, ...], dtype[float32]]] | None = None#
device: str | None = None#
dtype[source]#

alias of float32

embed(texts)[source]#

Compute embeddings for texts without caching.

Parameters:
textslist of str

Non-empty list of text strings. Empty strings produce zero vectors; they are not filtered here (filtering belongs in the pipeline).

Returns:
numpy.ndarray

Array of shape (len(texts), dim) with dtype self.dtype.

Raises:
ValueError

If texts is empty.

ImportError

If the required backend library is not installed.

Parameters:

texts (list[str])

Return type:

ndarray[tuple[Any, …], dtype[float32]]

Examples

>>> engine = EmbeddingEngine(
...     backend="custom",
...     custom_fn=lambda t: np.zeros((len(t), 32), dtype=np.float32),
... )
>>> engine.embed(["hello"]).shape
(1, 32)
embed_documents(documents, source_path=None)[source]#

Embed a list of CorpusDocument instances in-place (sets doc.embedding on each).

Parameters:
documentslist of CorpusDocument

Documents to embed. Each must have a non-empty text field.

source_pathpathlib.Path or None, optional

Source path for cache key. None disables caching.

Returns:
list of CorpusDocument

The same list with embedding fields populated (via replace() — originals are not mutated).

Parameters:
Return type:

list[Any]

Examples

>>> docs = list(reader.get_documents())
>>> docs = engine.embed_documents(docs)
>>> docs[0].has_embedding
True
embed_with_cache(texts, source_path)[source]#

Compute embeddings with file caching keyed to source_path.

Parameters:
textslist of str

Text strings to embed.

source_pathpathlib.Path

Path to the source file that generated texts. Used to build the cache key (path + mtime + len(texts)).

Returns:
embeddingsnumpy.ndarray

Array of shape (len(texts), dim).

from_cachebool

True if the result was loaded from disk cache.

Raises:
ValueError

If texts is empty.

OSError

If the cache directory cannot be created.

Parameters:
Return type:

tuple[ndarray[tuple[Any, …], dtype[float32]], bool]

Examples

>>> vecs, cached = engine.embed_with_cache(texts, Path("corpus.txt"))
>>> cached  # True on second call with same inputs
False
enable_cache: bool = True#
model_name: str = 'paraphrase-multilingual-mpnet-base-v2'#
normalize: bool = True#
show_progress_bar: bool = False#