spotify/ANNoy Vector Database (Approximate Nearest Neighbors)#
ANNoy helps you find similar items fast.
You give your data as vectors (arrays of numbers). Then you can search for the nearest neighbors (the most similar vectors).
This page documents the Annoy [1] user guide integration shipped with scikit-plots.
Python API#
ANNoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings for approximate nearest-neighbor search in high-dimensional vector spaces. [1]
A vector is a list of numbers.
“Similarity search” means: find items that are close to each other.
Annoy builds an index so searching is fast.
Many AI apps convert data into vectors:
text → embeddings
images → embeddings
audio → embeddings
After that, you can do vector similarity search: you search with a query vector and get back the closest vectors. This is useful for:
semantic search (meaning-based search)
recommendations (“users like you also liked…”)
de-duplication (find near-duplicates)
Annoy is an approximate nearest neighbor (ANN) method.
It is fast for search.
It is approximate (it may not return the exact best answer every time).
You often tune index/search parameters to trade off speed vs accuracy.
If you need a full service (storage, filtering, replication, access control), you may want a vector database instead of a single in-process index.
When to use Annoy#
Use Annoy when you want:
a lightweight, local index (often inside one Python process)
fast similarity search for a static or slowly changing dataset
simple deployment (no separate database service)
If you need distributed storage, many writers, metadata filtering, or a managed service, see Vector Similarity Search and Vector Database.
TL;DR#
What it is: C++ library with Python bindings for approximate nearest-neighbor (ANN) search. [1]
Origin: Developed at Spotify (Hack Week). [1]
Index type: Forest of random projection trees. [1]
Storage: File-based indexes can be memory-mapped (mmap) and shared across processes. [1]
Tuning: Use
n_trees(build) andsearch_k(query) to trade accuracy for speed. [1]Metrics: Euclidean, Manhattan, cosine (angular), Hamming, dot (inner product). [1]
Quick start#
import random
random.seed(0)
# from annoy import AnnoyIndex
# from scikitplot.cexternals._annoy import AnnoyIndex
from scikitplot.annoy import AnnoyIndex
f = 40
t = AnnoyIndex(f, "angular")
for i in range(1000):
v = [random.gauss(0, 1) for _ in range(f)]
t.add_item(i, v)
t.build(10) # n_trees
t.save("test.ann")
u = AnnoyIndex(f, "angular")
u.load("test.ann") # memory-mapped (mmap)
print(u.get_nns_by_item(0, 10))
Workflow#
Create an
AnnoyIndexwith vector lengthfand a metric. [1]Add items with
add_item. [1]Build the forest with
build. [1]Query with
get_nns_by_itemorget_nns_by_vector. [1]Persist with
saveand load withload. [1]
Important rules#
Every added vector must have length
f.Add items before calling
build. [1]After
build, the index is used for queries. To add more items, discard the forest withunbuild, add items, and build again.
Persistence and sharing#
Save and load#
savewrites the index to a file.loadmemory-maps (mmap) the file for fast loading and sharing across processes. [1]
Prefault (optional)#
Some builds expose a prefault option for load. When enabled,
the loader may aggressively fault pages into memory. This is platform dependent. [1]
On-disk build (large datasets)#
Annoy can build an index directly into a file on disk. This is intended for datasets that are too large to fit into memory during index construction. [1]
Workflow#
Important rules#
Example#
import random
from scikitplot.annoy import AnnoyIndex
random.seed(0)
f = 40
a = AnnoyIndex(f, "angular")
a.on_disk_build("big.ann")
for i in range(1000):
v = [random.gauss(0, 1) for _ in range(f)]
a.add_item(i, v)
a.build(10)
# In another process, load the same file (mmap):
b = AnnoyIndex(f, "angular")
b.load("big.ann")
print(b.get_nns_by_item(0, 10))
Tuning#
n_trees(build time): Larger values usually improve recall but increase build time and memory usage. [1]search_k(query time): Larger values usually improve recall but make queries slower. [1]
If search_k is not provided, Annoy uses a default based on the number of trees
and the requested neighbor count. [1]
Practical tips#
Choose stable item ids#
Annoy uses non-negative integer item ids and allocates storage up to max(id) + 1. [1]
If your external ids are sparse or non-numeric, keep a separate mapping to compact ids.
Multi-process serving#
A common serving pattern is:
Build once and write an index file.
In each worker process, load the same file with
load(mmap) and query. [1]
Developer notes (C++)#
AnnoyIndex is not copyable (it contains atomic state). In C++14, avoid copy
initialization from temporaries. Prefer direct initialization:
using Index = Annoy::AnnoyIndex<int, double, Annoy::Angular, Annoy::Kiss32Random,
Annoy::AnnoyIndexSingleThreadedBuildPolicy>;
Index t(f); // C++14 compatible
See also#
See also
pickle(Python standard library)Alternative ANN libraries: - nmslib/hnswlib - spotify/voyager
References#
Glossary#
- Vector
An array of numbers (often an embedding).
- Embedding
A vector representation created by a model (text/image/audio → numbers).
- Vector similarity search
Find vectors in a dataset that are most similar to a query vector.
- Vector database
A system that stores, manages, and indexes high-dimensional vector data for low-latency similarity queries.