Annoy#
- class scikitplot.cexternals._annoy.Annoy#
Compiled with GCC/Clang. Using 512-bit AVX instructions.
High-performance approximate nearest neighbours (Annoy) C++ core.
This module is a low-level backend (
annoylib). It exposes the C++-poweredAnnoytype. For day-to-day work, prefer the high-level Python API in theannoypackage:from annoy import Annoy, AnnoyIndex
- add_item(i, vector)#
Add a single embedding vector to the index.
- Parameters:
- iint
Integer identifier (indice) for this row. Must be non-negative. Annoy will internally allocate space up to
max(i) + 1.- vectorsequence of float
1D embedding of length
f. Iff == 0on the first call, the dimension is inferred fromvectorand fixed for the lifetime of the index.
- Returns:
- selfAnnoy
The index itself, allowing chained calls, e.g.:
index.add_item(0, v0).add_item(1, v1)
Notes
Items must be added before calling
build. After building the forest, further calls toadd_itemare not supported.Examples
>>> index.add_item(0, [1.0, 0.0, 0.0]) >>> index.add_item(1, [0.0, 1.0, 0.0])
- build(n_trees, n_jobs=-1)#
Build a forest of random projection trees.
- Parameters:
- n_treesint
Number of trees in the forest. Larger values typically improve recall at the cost of slower build and query time.
- n_jobsint, optional (default=-1)
Number of threads to use while building.
-1means “use all available CPU cores”.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
After
buildcompletes, the index becomes read-only for queries. To add more items, callunbuild, add items again, and then rebuild.References
Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.
- deserialize(byte, prefault=False)#
Restore the index from a serialized byte string.
- Parameters:
- bytebytes
Byte string produced by
serialize.- prefaultbool, optional (default=False)
If True, fault pages into memory while restoring.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- f#
number of features (vector dimension)
- get_distance(i, j) float#
Return the distance between two stored items.
- Parameters:
- i, jint
Row identifiers (indices) of two stored samples.
- Returns:
- dfloat
Distance between items
iandjunder the current metric (angular, euclidean, manhattan, dot or hamming).
- get_item_vector(i) list[float]#
Return the stored embedding vector for a given indice.
- Parameters:
- iint
Row identifier (indice) previously passed to
add_item.
- Returns:
- vectorlist[float]
Stored embedding of length
f.
- get_n_items() int#
Return the number of stored samples (rows) in the index.
- Returns:
- n_itemsint
Number of items that have been added and are currently addressable by nearest-neighbour queries.
- get_n_trees() int#
Return the number of trees in the current forest.
- Returns:
- n_treesint
Number of trees that have been built.
- get_nns_by_item(i, n, search_k=-1, include_distances=True)#
Return the
nnearest neighbours for a stored sample.- Parameters:
- iint
Row identifier (indice) previously passed to
add_item(i, embedding).- nint
Number of nearest neighbours to return.
- search_kint, optional (default=-1)
Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional (default=True)
If True, return a
(indices, distances)tuple, otherwise only the list of indices.
- Returns:
- indiceslist[int]
Nearest neighbour indices (item ids).
- distanceslist[float]
Corresponding distances (only if
include_distances=True).
See also
get_nns_by_itemAlias for this method for backward compatibility.
get_nns_by_vectorQuery with an explicit query embedding.
- get_nns_by_vector(vector, n, search_k=-1, include_distances=True)#
Return the
nnearest neighbours for a query embedding.- Parameters:
- vectorsequence of float
Query embedding of length
f.- nint
Number of nearest neighbours to return.
- search_kint, optional (default=-1)
Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional (default=True)
If True, return a
(indices, distances)tuple, otherwise only the list of indices.
- Returns:
- indiceslist[int]
Nearest neighbour indices (item ids).
- distanceslist[float]
Corresponding distances (only if
include_distances=True).
See also
get_nns_by_itemQuery by stored sample id.
- info() dict#
Return a structured summary of the index.
This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.
- Returns:
- infodict
Dictionary describing the current index state.
Examples
>>> info = idx.info() >>> info['dimension'] 100 >>> info['n_items'] 1000
- load(fn, prefault=False)#
Load (mmap) an index from disk into the current object.
- Parameters:
- fnstr
Path to a file previously created by
saveoron_disk_build.- prefaultbool, optional (default=False)
If True, fault pages into memory when the file is mapped.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
The in-memory index must have been constructed with the same dimension and metric as the on-disk file.
- memory_usage() int#
Approximate memory usage of the index in byte.
- Returns:
- n_byteint
Approximate number of byte used by the index. When native support is unavailable, this is estimated via
serialize.
- metric#
distance metric (angular, euclidean, manhattan, dot, hamming)
- on_disk_build(fn)#
Configure the index to build using an on-disk backing file.
- Parameters:
- fnstr
Path to a file that will hold the index during build. The file is created or overwritten as needed.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
This mode is useful for very large datasets that do not fit comfortably in RAM during construction.
- save(fn, prefault=False)#
Persist the index to a binary file on disk.
- Parameters:
- fnstr
Path to the output file. Existing files will be overwritten.
- prefaultbool, optional (default=False)
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for large indexes.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- serialize() bytes#
Serialize the full in-memory index into a byte string.
- Returns:
- bytebytes
Opaque binary blob containing the entire Annoy index.
See also
deserializerestore an index from byte.
Notes
The serialized form is a snapshot of the internal C++ data structures. It can be stored, transmitted or used with joblib without rebuilding trees.
- set_seed(seed=None)#
Set the random seed used for tree construction.
- Parameters:
- seedint, optional
Non-negative integer seed. If omitted, a library-specific default is used. For strict reproducibility, always call this method explicitly before
build.
- Returns:
- None
This method returns
None.
Notes
Using the same seed, data and
n_treesusually produces bitwise-identical forests (subject to CPU / threading details).
- unbuild()#
Discard the current forest, allowing new items to be added.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
After calling
unbuild, you must callbuildagain before running nearest-neighbour queries.
- unload()#
Unmap any memory-mapped file backing this index.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
This releases OS-level resources associated with the mmap, but keeps the Python object alive.
- verbose(level=1)#
Control verbosity of the underlying C++ index.
- Parameters:
- levelint, optional (default=1)
Logging level inspired by gradient-boosting libraries:
<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- Returns:
- selfAnnoy
The index itself, allowing chained calls.