Index#
- class scikitplot.annoy.Index(f=0, metric='angular')[source]#
High-level Pythonic Annoy wrapper with picklable (or pickle-able).
Minimal modify spotify/annoy low-level C-API to extend Python API.
See also
- add_item(i, vector)#
Add a single embedding vector to the index.
- Parameters:
- iint
Integer identifier (indice) for this row. Must be non-negative. Annoy will internally allocate space up to
max(i) + 1.- vectorsequence of float
1D embedding of length
f. Iff == 0on the first call, the dimension is inferred fromvectorand fixed for the lifetime of the index.
- Returns:
- selfAnnoy
The index itself, allowing chained calls, e.g.:
index.add_item(0, v0).add_item(1, v1)
Notes
Items must be added before calling
build. After building the forest, further calls toadd_itemare not supported.Examples
>>> index.add_item(0, [1.0, 0.0, 0.0]) >>> index.add_item(1, [0.0, 1.0, 0.0])
- build(n_trees, n_jobs=-1)#
Build a forest of random projection trees.
- Parameters:
- n_treesint
Number of trees in the forest. Larger values typically improve recall at the cost of slower build and query time.
- n_jobsint, optional (default=-1)
Number of threads to use while building.
-1means “use all available CPU cores”.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
After
buildcompletes, the index becomes read-only for queries. To add more items, callunbuild, add items again, and then rebuild.References
Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.
- deserialize(byte, prefault=False)#
Restore the index from a serialized byte string.
- Parameters:
- bytebytes
Byte string produced by
serialize.- prefaultbool, optional (default=False)
If True, fault pages into memory while restoring.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- f#
number of features (vector dimension)
- classmethod from_low_level(obj, prefault=False)[source]#
Convert a low-level Annoy instance into a high-level Index.
Instance by round-tripping through serialize/deserialize.
- get_distance(i, j) float#
Return the distance between two stored items.
- Parameters:
- i, jint
Row identifiers (indices) of two stored samples.
- Returns:
- dfloat
Distance between items
iandjunder the current metric (angular, euclidean, manhattan, dot or hamming).
- get_item_vector(i) list[float]#
Return the stored embedding vector for a given indice.
- Parameters:
- iint
Row identifier (indice) previously passed to
add_item.
- Returns:
- vectorlist[float]
Stored embedding of length
f.
- get_n_items() int#
Return the number of stored samples (rows) in the index.
- Returns:
- n_itemsint
Number of items that have been added and are currently addressable by nearest-neighbour queries.
- get_n_trees() int#
Return the number of trees in the current forest.
- Returns:
- n_treesint
Number of trees that have been built.
- get_neighbor_ids_by_item(item, n, *, search_k=-1, include_self=False, include_distances=False)[source]#
- get_neighbor_ids_by_vector(vector, n, *, search_k=-1, include_distances=False, include_self=False, exclude_item=None, exclude_item_ids=None)[source]#
- get_neighbor_vectors_by_item(item, n, *, search_k=-1, include_self=False, include_distances=False, as_numpy=False, dtype='float32')[source]#
- get_neighbor_vectors_by_vector(vector, n, *, search_k=-1, include_distances=False, include_self=False, exclude_item=None, exclude_item_ids=None, as_numpy=False, dtype='float32')[source]#
- get_nns_by_item(i, n, search_k=-1, include_distances=True)#
Return the
nnearest neighbours for a stored sample.- Parameters:
- iint
Row identifier (indice) previously passed to
add_item(i, embedding).- nint
Number of nearest neighbours to return.
- search_kint, optional (default=-1)
Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional (default=True)
If True, return a
(indices, distances)tuple, otherwise only the list of indices.
- Returns:
- indiceslist[int]
Nearest neighbour indices (item ids).
- distanceslist[float]
Corresponding distances (only if
include_distances=True).
See also
get_nns_by_itemAlias for this method for backward compatibility.
get_nns_by_vectorQuery with an explicit query embedding.
- get_nns_by_vector(vector, n, search_k=-1, include_distances=True)#
Return the
nnearest neighbours for a query embedding.- Parameters:
- vectorsequence of float
Query embedding of length
f.- nint
Number of nearest neighbours to return.
- search_kint, optional (default=-1)
Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional (default=True)
If True, return a
(indices, distances)tuple, otherwise only the list of indices.
- Returns:
- indiceslist[int]
Nearest neighbour indices (item ids).
- distanceslist[float]
Corresponding distances (only if
include_distances=True).
See also
get_nns_by_itemQuery by stored sample id.
- info() dict#
Return a structured summary of the index.
This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.
- Returns:
- infodict
Dictionary describing the current index state.
Examples
>>> info = idx.info() >>> info['dimension'] 100 >>> info['n_items'] 1000
- iter_item_vectors(ids=None, *, start=0, stop=None, with_ids=True)[source]#
Iterate item vectors in a memory-safe way.
- Parameters:
- ids:
Explicit item ids. Must be a sized Sequence for strictness.
- start, stop:
Used only when ids is None.
- with_ids:
If True yield (id, vector), else yield vector.
- Yields:
- (id, vector) or vector
- Parameters:
- Return type:
- iter_neighbor_vectors_by_vector(vector, n, *, search_k=-1, include_self=False, exclude_item=None, exclude_item_ids=None)[source]#
- load(fn, prefault=False)[source]#
Load (mmap) an index from disk into the current object.
- Parameters:
- fnstr
Path to a file previously created by
saveoron_disk_build.- prefaultbool, optional (default=False)
If True, fault pages into memory when the file is mapped.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- Parameters:
Notes
The in-memory index must have been constructed with the same dimension and metric as the on-disk file.
- memory_usage() int#
Approximate memory usage of the index in byte.
- Returns:
- n_byteint
Approximate number of byte used by the index. When native support is unavailable, this is estimated via
serialize.
- metric#
distance metric (angular, euclidean, manhattan, dot, hamming)
- on_disk_build(fn)[source]#
Configure the index to build using an on-disk backing file.
- Parameters:
- fnstr
Path to a file that will hold the index during build. The file is created or overwritten as needed.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- Parameters:
path (str)
Notes
This mode is useful for very large datasets that do not fit comfortably in RAM during construction.
- save(fn, prefault=False)[source]#
Persist the index to a binary file on disk.
- Parameters:
- fnstr
Path to the output file. Existing files will be overwritten.
- prefaultbool, optional (default=False)
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for large indexes.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
- Parameters:
- save_vectors_npy(path, ids=None, *, start=0, stop=None, dtype='float32', overwrite=True)[source]#
Save vectors into a .npy file using NumPy open_memmap.
This is the recommended path for very large indexes.
- serialize() bytes#
Serialize the full in-memory index into a byte string.
- Returns:
- bytebytes
Opaque binary blob containing the entire Annoy index.
See also
deserializerestore an index from byte.
Notes
The serialized form is a snapshot of the internal C++ data structures. It can be stored, transmitted or used with joblib without rebuilding trees.
- set_seed(seed=None)#
Set the random seed used for tree construction.
- Parameters:
- seedint, optional
Non-negative integer seed. If omitted, a library-specific default is used. For strict reproducibility, always call this method explicitly before
build.
- Returns:
- None
This method returns
None.
Notes
Using the same seed, data and
n_treesusually produces bitwise-identical forests (subject to CPU / threading details).
- to_csv(path, ids=None, *, start=0, stop=None, include_id=True, header=True, delimiter=',', float_format=None, columns=None, dtype='float32')[source]#
Stream vectors to CSV without building a full DataFrame.
This is safer than df.to_csv for large exports.
Notes
CSV for 1B rows will be extremely large and slow. Consider Parquet in the future.
- to_dataframe(ids=None, *, start=0, stop=None, include_id=True, columns=None, dtype='float32')[source]#
Materialize vectors into a Pandas DataFrame.
WARNING: Not suitable for huge indexes.
- to_numpy(ids=None, *, start=0, stop=None, dtype='float32')[source]#
Materialize vectors into an in-memory NumPy array.
WARNING: Not suitable for huge indexes.
- unbuild()#
Discard the current forest, allowing new items to be added.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
After calling
unbuild, you must callbuildagain before running nearest-neighbour queries.
- unload()#
Unmap any memory-mapped file backing this index.
- Returns:
- selfAnnoy
The index itself, allowing chained calls.
Notes
This releases OS-level resources associated with the mmap, but keeps the Python object alive.
- verbose(level=1)#
Control verbosity of the underlying C++ index.
- Parameters:
- levelint, optional (default=1)
Logging level inspired by gradient-boosting libraries:
<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- Returns:
- selfAnnoy
The index itself, allowing chained calls.