spotify/ANNoy (experimental)#

This page documents the Annoy [0] integration shipped with scikit-plots.

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings for approximate nearest-neighbor search in high-dimensional vector spaces. [1]

TL;DR#

  • What it is: C++ library with Python bindings for approximate nearest-neighbor (ANN) search. [1]

  • Origin: Developed at Spotify (Hack Week). [1]

  • Since: Open sourced in 2013. [3]

  • Index type: Forest of random projection trees. [1]

  • Storage: File-based indexes can be memory-mapped (mmap) and shared across processes. [1]

  • Tuning: Use n_trees (build) and search_k (query) to trade accuracy for speed. [1]

  • Metrics: Euclidean, Manhattan, cosine (angular), Hamming, dot (inner product). [1]

Quick start#

import random
random.seed(0)

# from annoy import AnnoyIndex
# from scikitplot.cexternals._annoy import AnnoyIndex
from scikitplot.annoy import AnnoyIndex

f = 40
t = AnnoyIndex(f, "angular")

for i in range(1000):
    v = [random.gauss(0, 1) for _ in range(f)]
    t.add_item(i, v)

t.build(10)            # n_trees
t.save("test.ann")

u = AnnoyIndex(f, "angular")
u.load("test.ann")     # memory-mapped (mmap)
print(u.get_nns_by_item(0, 10))

Workflow#

  1. Create an AnnoyIndex with vector length f and a metric. [1]

  2. Add items with add_item. [1]

  3. Build the forest with build. [1]

  4. Query with get_nns_by_item or get_nns_by_vector. [1]

  5. Persist with save and load with load. [1]

Important rules#

  • Every added vector must have length f.

  • Add items before calling build. [1]

  • After build, the index is used for queries. To add more items, discard the forest with unbuild, add items, and build again.

Persistence and sharing#

Save and load#

  • save writes the index to a file.

  • load memory-maps (mmap) the file for fast loading and sharing across processes. [1]

Prefault (optional)#

Some builds expose a prefault option for load. When enabled, the loader may aggressively fault pages into memory. This is platform dependent. [1]

On-disk build (large datasets)#

Annoy can build an index directly into a file on disk. This is intended for datasets that are too large to fit into memory during index construction. [1]

Workflow#

  1. Create the index.

  2. Call on_disk_build before adding any items. [1]

  3. Add items.

  4. Call build.

  5. Query the index, or load it from other processes with load. [1]

Important rules#

  • Call on_disk_build before add_item. [1]

  • After building in this mode, there is no need to call save because the file is already the backing store. [1]

Example#

import random
from scikitplot.annoy import AnnoyIndex

random.seed(0)
f = 40

a = AnnoyIndex(f, "angular")
a.on_disk_build("big.ann")

for i in range(1000):
    v = [random.gauss(0, 1) for _ in range(f)]
    a.add_item(i, v)

a.build(10)

# In another process, load the same file (mmap):
b = AnnoyIndex(f, "angular")
b.load("big.ann")
print(b.get_nns_by_item(0, 10))

Tuning#

  • n_trees (build time): Larger values usually improve recall but increase build time and memory usage. [1]

  • search_k (query time): Larger values usually improve recall but make queries slower. [1]

If search_k is not provided, Annoy uses a default based on the number of trees and the requested neighbor count. [1]

Practical tips#

Choose stable item ids#

Annoy uses non-negative integer item ids and allocates storage up to max(id) + 1. [1] If your external ids are sparse or non-numeric, keep a separate mapping to compact ids.

Multi-process serving#

A common serving pattern is:

  1. Build once and write an index file.

  2. In each worker process, load the same file with load (mmap) and query. [1]

Developer notes (C++)#

AnnoyIndex is not copyable (it contains atomic state). In C++14, avoid copy initialization from temporaries. Prefer direct initialization:

using Index = Annoy::AnnoyIndex<int, double, Annoy::Angular, Annoy::Kiss32Random,
                                Annoy::AnnoyIndexSingleThreadedBuildPolicy>;
Index t(f);  // C++14 compatible

See also#

See also

References#