spotify/ANNoy (experimental)#
This page documents the Annoy [0] integration shipped with scikit-plots.
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings for approximate nearest-neighbor search in high-dimensional vector spaces. [1]
TL;DR#
What it is: C++ library with Python bindings for approximate nearest-neighbor (ANN) search. [1]
Origin: Developed at Spotify (Hack Week). [1]
Since: Open sourced in 2013. [3]
Index type: Forest of random projection trees. [1]
Storage: File-based indexes can be memory-mapped (mmap) and shared across processes. [1]
Tuning: Use
n_trees(build) andsearch_k(query) to trade accuracy for speed. [1]Metrics: Euclidean, Manhattan, cosine (angular), Hamming, dot (inner product). [1]
Quick start#
import random
random.seed(0)
# from annoy import AnnoyIndex
# from scikitplot.cexternals._annoy import AnnoyIndex
from scikitplot.annoy import AnnoyIndex
f = 40
t = AnnoyIndex(f, "angular")
for i in range(1000):
v = [random.gauss(0, 1) for _ in range(f)]
t.add_item(i, v)
t.build(10) # n_trees
t.save("test.ann")
u = AnnoyIndex(f, "angular")
u.load("test.ann") # memory-mapped (mmap)
print(u.get_nns_by_item(0, 10))
Workflow#
Create an
AnnoyIndexwith vector lengthfand a metric. [1]Add items with
add_item. [1]Build the forest with
build. [1]Query with
get_nns_by_itemorget_nns_by_vector. [1]Persist with
saveand load withload. [1]
Important rules#
Every added vector must have length
f.Add items before calling
build. [1]After
build, the index is used for queries. To add more items, discard the forest withunbuild, add items, and build again.
Persistence and sharing#
Save and load#
savewrites the index to a file.loadmemory-maps (mmap) the file for fast loading and sharing across processes. [1]
Prefault (optional)#
Some builds expose a prefault option for load. When enabled,
the loader may aggressively fault pages into memory. This is platform dependent. [1]
On-disk build (large datasets)#
Annoy can build an index directly into a file on disk. This is intended for datasets that are too large to fit into memory during index construction. [1]
Workflow#
Important rules#
Example#
import random
from scikitplot.annoy import AnnoyIndex
random.seed(0)
f = 40
a = AnnoyIndex(f, "angular")
a.on_disk_build("big.ann")
for i in range(1000):
v = [random.gauss(0, 1) for _ in range(f)]
a.add_item(i, v)
a.build(10)
# In another process, load the same file (mmap):
b = AnnoyIndex(f, "angular")
b.load("big.ann")
print(b.get_nns_by_item(0, 10))
Tuning#
n_trees(build time): Larger values usually improve recall but increase build time and memory usage. [1]search_k(query time): Larger values usually improve recall but make queries slower. [1]
If search_k is not provided, Annoy uses a default based on the number of trees
and the requested neighbor count. [1]
Practical tips#
Choose stable item ids#
Annoy uses non-negative integer item ids and allocates storage up to max(id) + 1. [1]
If your external ids are sparse or non-numeric, keep a separate mapping to compact ids.
Multi-process serving#
A common serving pattern is:
Build once and write an index file.
In each worker process, load the same file with
load(mmap) and query. [1]
Developer notes (C++)#
AnnoyIndex is not copyable (it contains atomic state). In C++14, avoid copy
initialization from temporaries. Prefer direct initialization:
using Index = Annoy::AnnoyIndex<int, double, Annoy::Angular, Annoy::Kiss32Random,
Annoy::AnnoyIndexSingleThreadedBuildPolicy>;
Index t(f); // C++14 compatible
See also#
See also
pickle(Python standard library)Alternative ANN libraries: - nmslib/hnswlib - spotify/voyager