Annoy#

class scikitplot.cexternals._annoy.Annoy#

Compiled with GCC/Clang. Using 512-bit AVX instructions.

High-performance approximate nearest neighbours (Annoy) C++ core.

This module is a low-level backend (annoylib). It exposes the C++-powered Annoy type. For day-to-day work, prefer the high-level Python API in the annoy package:

from annoy import Annoy, AnnoyIndex

add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:

iint: Integer identifier (indice) for this row. Must be non-negative. Annoy will internally allocate space up to max(i) + 1.
vectorsequence of float: 1D embedding of length f. If f == 0 on the first call, the dimension is inferred from vector and fixed for the lifetime of the index.

Returns:

selfAnnoy

The index itself, allowing chained calls, e.g.:

index.add_item(0, v0).add_item(1, v1)

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> index.add_item(0, [1.0, 0.0, 0.0])
>>> index.add_item(1, [0.0, 1.0, 0.0])

build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:

n_treesint: Number of trees in the forest. Larger values typically improve recall at the cost of slower build and query time.
n_jobsint, optional (default=-1): Number of threads to use while building. -1 means “use all available CPU cores”.

Returns:

selfAnnoy: The index itself, allowing chained calls.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items again, and then rebuild.

References

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

deserialize(byte, prefault=False)#

Restore the index from a serialized byte string.

Parameters:

bytebytes: Byte string produced by serialize.
prefaultbool, optional (default=False): If True, fault pages into memory while restoring.

Returns:

selfAnnoy: The index itself, allowing chained calls.

f#: number of features (vector dimension)

get_distance(i, j) → float#

Return the distance between two stored items.

Parameters:

i, jint: Row identifiers (indices) of two stored samples.

Returns:

dfloat: Distance between items i and j under the current metric (angular, euclidean, manhattan, dot or hamming).

get_item_vector(i) → list[float]#

Return the stored embedding vector for a given indice.

Parameters:

iint: Row identifier (indice) previously passed to add_item.

Returns:

vectorlist[float]: Stored embedding of length f.

get_n_items() → int#

Return the number of stored samples (rows) in the index.

Returns:

n_itemsint: Number of items that have been added and are currently addressable by nearest-neighbour queries.

get_n_trees() → int#

Return the number of trees in the current forest.

Returns:

n_treesint: Number of trees that have been built.

get_nns_by_item(i, n, search_k=-1, include_distances=True)#

Return the n nearest neighbours for a stored sample.

Parameters:

iint: Row identifier (indice) previously passed to add_item(i, embedding).
nint: Number of nearest neighbours to return.
search_kint, optional (default=-1): Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.
include_distancesbool, optional (default=True): If True, return a (indices, distances) tuple, otherwise only the list of indices.

Returns:

indiceslist[int]: Nearest neighbour indices (item ids).
distanceslist[float]: Corresponding distances (only if include_distances=True).

See also

get_nns_by_item: Query by stored sample id.

info() → dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Returns:

infodict: Dictionary describing the current index state.

Examples

>>> info = idx.info()
>>> info['dimension']
100
>>> info['n_items']
1000

load(fn, prefault=False)#

Load (mmap) an index from disk into the current object.

Parameters:

fnstr: Path to a file previously created by save or on_disk_build.
prefaultbool, optional (default=False): If True, fault pages into memory when the file is mapped.

Returns:

selfAnnoy: The index itself, allowing chained calls.

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

memory_usage() → int#

Approximate memory usage of the index in byte.

Returns:

n_byteint: Approximate number of byte used by the index. When native support is unavailable, this is estimated via serialize.

metric#: distance metric (angular, euclidean, manhattan, dot, hamming)

on_disk_build(fn)#

Configure the index to build using an on-disk backing file.

Parameters:

fnstr: Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:

selfAnnoy: The index itself, allowing chained calls.

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

save(fn, prefault=False)#

Persist the index to a binary file on disk.

Parameters:

fnstr: Path to the output file. Existing files will be overwritten.
prefaultbool, optional (default=False): If True, aggressively fault pages into memory during save. Primarily useful on some platforms for large indexes.

Returns:

selfAnnoy: The index itself, allowing chained calls.

serialize() → bytes#

Serialize the full in-memory index into a byte string.

Returns:

bytebytes: Opaque binary blob containing the entire Annoy index.

Annoy#

This Page