ANNoy Vector Database#
ANNOY (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees,
a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half
and each leaf stores one data point. Trees are built independently and at random, so to some extent,
it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half
that is closest to the query and then aggregates the results.
The idea is quite related to KD tree but a lot more scalable.
LLM Powered Autonomous Agents <https://lilianweng.github.io/posts/2023-06-23-agent/>
ANNoy helps you find similar items fast.
You give your data as vectors (arrays of numbers). Then you can search for the nearest neighbors (the most similar vectors).
This page documents the Annoy [1] user guide integration shipped with scikit-plots.
cython
Public Python API#
This module exports:
Annoy: Low-level C-extension type (stable, picklable).AnnoyIndex: Public alias of the Low-levelAnnoyindex.Index: High-level Python wrapper subclass (stable, picklable).
Note
For backend and C-extension details, see spotify/ANNoy Vector Database (Approximate Nearest Neighbors).
High-level Python interface for the C++ Annoy backend.
This page documents annoy. It provides a stable import path
and a small, user-facing API built on the low-level bindings in
_annoy.
Workflow#
Create an
AnnoyIndexwith a fixed vector lengthfand a metric.Add items with
add_item.Build the forest with
build.Save and load with
saveandload.
Quick start#
Examples
import random; random.seed(0)
# from annoy import Annoy, AnnoyIndex
# from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
from scikitplot.annoy import Annoy, AnnoyIndex, Index
f = 40 # Length of item vector that will be indexed
t = AnnoyIndex(f, 'angular')
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # memory-mapped
print(u.get_nns_by_item(0, 1000))
Notes#
Every added vector must have length
f.Add items before calling
build.Item ids are integers. Storage is allocated up to
max(id) + 1.
High-level wrapper: Index#
Index is a Pythonic wrapper for Annoy-like objects.
It is designed for higher-level workflows where you want a Python object that is safe to serialize and move between processes.
Mixins used by the high-level wrapper#
The wrapper uses mixins _mixins
to keep features separate and explicit.
Further reading#
See also
See also
Nearest neighbor search (background): https://en.wikipedia.org/wiki/Nearest_neighbor_search
https://www.researchgate.net/publication/363234433_Analysis_of_Image_Similarity_Using_CNN_and_ANNOY
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf
https://link.springer.com/chapter/10.1007/978-981-97-7831-7_2