ANNoy Vector Database#

ANNOY (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable. LLM Powered Autonomous Agents <https://lilianweng.github.io/posts/2023-06-23-agent/>

ANNoy helps you find similar items fast.

You give your data as vectors (arrays of numbers). Then you can search for the nearest neighbors (the most similar vectors).

This page documents the Annoy [1] user guide integration shipped with scikit-plots.

  • Low-level bindings C-API: _annoy

  • High-level Python-API: annoy

cython

Public Python API#

This module exports:

  • Annoy: Low-level C-extension type (stable, picklable).

  • AnnoyIndex: Public alias of the Low-level Annoy index.

  • Index: High-level Python wrapper subclass (stable, picklable).

Note

For backend and C-extension details, see spotify/ANNoy Vector Database (Approximate Nearest Neighbors).

High-level Python interface for the C++ Annoy backend.

This page documents annoy. It provides a stable import path and a small, user-facing API built on the low-level bindings in _annoy.

Workflow#

  1. Create an AnnoyIndex with a fixed vector length f and a metric.

  2. Add items with add_item.

  3. Build the forest with build.

  4. Save and load with save and load.

Quick start#

Examples

import random; random.seed(0)
# from annoy import Annoy, AnnoyIndex
# from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
from scikitplot.annoy import Annoy, AnnoyIndex, Index

f = 40  # Length of item vector that will be indexed
t = AnnoyIndex(f, 'angular')

for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10)  # 10 trees
t.save('test.ann')

u = AnnoyIndex(f, 'angular')
u.load('test.ann')  # memory-mapped

print(u.get_nns_by_item(0, 1000))

Notes#

  • Every added vector must have length f.

  • Add items before calling build.

  • Item ids are integers. Storage is allocated up to max(id) + 1.

High-level wrapper: Index#

Index is a Pythonic wrapper for Annoy-like objects.

It is designed for higher-level workflows where you want a Python object that is safe to serialize and move between processes.

Mixins used by the high-level wrapper#

The wrapper uses mixins _mixins to keep features separate and explicit.

Further reading#

References#