Index#

class scikitplot.annoy.Index(f=0, metric='angular')[source]#

High-level Pythonic Annoy wrapper with picklable (or pickle-able).

Minimal modify spotify/annoy low-level C-API to extend Python API.

Parameters:
add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:
iint

Integer identifier (indice) for this row. Must be non-negative. Annoy will internally allocate space up to max(i) + 1.

vectorsequence of float

1D embedding of length f. If f == 0 on the first call, the dimension is inferred from vector and fixed for the lifetime of the index.

Returns:
selfAnnoy

The index itself, allowing chained calls, e.g.:

index.add_item(0, v0).add_item(1, v1)

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> index.add_item(0, [1.0, 0.0, 0.0])
>>> index.add_item(1, [0.0, 1.0, 0.0])
build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:
n_treesint

Number of trees in the forest. Larger values typically improve recall at the cost of slower build and query time.

n_jobsint, optional (default=-1)

Number of threads to use while building. -1 means “use all available CPU cores”.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items again, and then rebuild.

References

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

property compress_mode: Literal['zlib', 'gzip'] | None#

!! processed by numpydoc !!

deserialize(byte, prefault=False)#

Restore the index from a serialized byte string.

Parameters:
bytebytes

Byte string produced by serialize.

prefaultbool, optional (default=False)

If True, fault pages into memory while restoring.

Returns:
selfAnnoy

The index itself, allowing chained calls.

dump_binary(path, *, backend='pickle')[source]#
Parameters:
  • path (str)

  • backend (Literal['pickle', 'joblib', 'cloudpickle'])

Return type:

None

f#

number of features (vector dimension)

classmethod from_json(path)[source]#
Parameters:

path (str)

classmethod from_low_level(obj, prefault=False)[source]#

Convert a low-level Annoy instance into a high-level Index.

Instance by round-tripping through serialize/deserialize.

Parameters:
Return type:

Index

classmethod from_manifest(manifest)[source]#
Parameters:

manifest (dict[str, Any])

classmethod from_yaml(path)[source]#
Parameters:

path (str)

get_distance(i, j) float#

Return the distance between two stored items.

Parameters:
i, jint

Row identifiers (indices) of two stored samples.

Returns:
dfloat

Distance between items i and j under the current metric (angular, euclidean, manhattan, dot or hamming).

get_item_vector(i) list[float]#

Return the stored embedding vector for a given indice.

Parameters:
iint

Row identifier (indice) previously passed to add_item.

Returns:
vectorlist[float]

Stored embedding of length f.

get_n_items() int#

Return the number of stored samples (rows) in the index.

Returns:
n_itemsint

Number of items that have been added and are currently addressable by nearest-neighbour queries.

get_n_trees() int#

Return the number of trees in the current forest.

Returns:
n_treesint

Number of trees that have been built.

get_neighbor_ids_by_item(item, n, *, search_k=-1, include_self=False, include_distances=False)[source]#
Parameters:
Return type:

List[int] | Tuple[List[int], List[float]]

get_neighbor_ids_by_vector(vector, n, *, search_k=-1, include_distances=False, include_self=False, exclude_item=None, exclude_item_ids=None)[source]#
Parameters:
Return type:

List[int] | Tuple[List[int], List[float]]

get_neighbor_vectors_by_item(item, n, *, search_k=-1, include_self=False, include_distances=False, as_numpy=False, dtype='float32')[source]#
Parameters:
Return type:

List[Sequence[float]] | ndarray | Tuple[List[Sequence[float]] | ndarray, List[float]]

get_neighbor_vectors_by_vector(vector, n, *, search_k=-1, include_distances=False, include_self=False, exclude_item=None, exclude_item_ids=None, as_numpy=False, dtype='float32')[source]#
Parameters:
Return type:

List[Sequence[float]] | ndarray | Tuple[List[Sequence[float]] | ndarray, List[float]]

get_nns_by_item(i, n, search_k=-1, include_distances=True)#

Return the n nearest neighbours for a stored sample.

Parameters:
iint

Row identifier (indice) previously passed to add_item(i, embedding).

nint

Number of nearest neighbours to return.

search_kint, optional (default=-1)

Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional (default=True)

If True, return a (indices, distances) tuple, otherwise only the list of indices.

Returns:
indiceslist[int]

Nearest neighbour indices (item ids).

distanceslist[float]

Corresponding distances (only if include_distances=True).

See also

get_nns_by_item

Alias for this method for backward compatibility.

get_nns_by_vector

Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=True)#

Return the n nearest neighbours for a query embedding.

Parameters:
vectorsequence of float

Query embedding of length f.

nint

Number of nearest neighbours to return.

search_kint, optional (default=-1)

Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional (default=True)

If True, return a (indices, distances) tuple, otherwise only the list of indices.

Returns:
indiceslist[int]

Nearest neighbour indices (item ids).

distanceslist[float]

Corresponding distances (only if include_distances=True).

See also

get_nns_by_item

Query by stored sample id.

info() dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Returns:
infodict

Dictionary describing the current index state.

Examples

>>> info = idx.info()
>>> info['dimension']
100
>>> info['n_items']
1000
iter_item_vectors(ids=None, *, start=0, stop=None, with_ids=True)[source]#

Iterate item vectors in a memory-safe way.

Parameters:
ids:

Explicit item ids. Must be a sized Sequence for strictness.

start, stop:

Used only when ids is None.

with_ids:

If True yield (id, vector), else yield vector.

Yields:
(id, vector) or vector
Parameters:
Return type:

Iterator[Sequence[float] | tuple[int, Sequence[float]]]

iter_neighbor_vectors_by_item(item, n, *, search_k=-1, include_self=False)[source]#
Parameters:
iter_neighbor_vectors_by_vector(vector, n, *, search_k=-1, include_self=False, exclude_item=None, exclude_item_ids=None)[source]#
Parameters:
load(fn, prefault=False)[source]#

Load (mmap) an index from disk into the current object.

Parameters:
fnstr

Path to a file previously created by save or on_disk_build.

prefaultbool, optional (default=False)

If True, fault pages into memory when the file is mapped.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

classmethod load_binary(path, *, backend='pickle')[source]#
Parameters:
  • path (str)

  • backend (Literal['pickle', 'joblib', 'cloudpickle'])

Return type:

Self

classmethod load_from_file(path)[source]#
Parameters:

path (str)

Return type:

Self

memory_usage() int#

Approximate memory usage of the index in byte.

Returns:
n_byteint

Approximate number of byte used by the index. When native support is unavailable, this is estimated via serialize.

metric#

distance metric (angular, euclidean, manhattan, dot, hamming)

on_disk_build(fn)[source]#

Configure the index to build using an on-disk backing file.

Parameters:
fnstr

Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:

path (str)

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

property on_disk_path: str | None#

!! processed by numpydoc !!

property pickle_mode: Literal['auto', 'byte', 'disk']#

!! processed by numpydoc !!

property prefault: bool#

!! processed by numpydoc !!

save(fn, prefault=False)[source]#

Persist the index to a binary file on disk.

Parameters:
fnstr

Path to the output file. Existing files will be overwritten.

prefaultbool, optional (default=False)

If True, aggressively fault pages into memory during save. Primarily useful on some platforms for large indexes.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:
save_to_file(path)[source]#
Parameters:

path (str)

Return type:

None

save_vectors_npy(path, ids=None, *, start=0, stop=None, dtype='float32', overwrite=True)[source]#

Save vectors into a .npy file using NumPy open_memmap.

This is the recommended path for very large indexes.

Returns:
path
Parameters:
Return type:

str

serialize() bytes#

Serialize the full in-memory index into a byte string.

Returns:
bytebytes

Opaque binary blob containing the entire Annoy index.

See also

deserialize

restore an index from byte.

Notes

The serialized form is a snapshot of the internal C++ data structures. It can be stored, transmitted or used with joblib without rebuilding trees.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:
seedint, optional

Non-negative integer seed. If omitted, a library-specific default is used. For strict reproducibility, always call this method explicitly before build.

Returns:
None

This method returns None.

Notes

Using the same seed, data and n_trees usually produces bitwise-identical forests (subject to CPU / threading details).

to_csv(path, ids=None, *, start=0, stop=None, include_id=True, header=True, delimiter=',', float_format=None, columns=None, dtype='float32')[source]#

Stream vectors to CSV without building a full DataFrame.

This is safer than df.to_csv for large exports.

Notes

CSV for 1B rows will be extremely large and slow. Consider Parquet in the future.

Parameters:
Return type:

str

to_dataframe(ids=None, *, start=0, stop=None, include_id=True, columns=None, dtype='float32')[source]#

Materialize vectors into a Pandas DataFrame.

WARNING: Not suitable for huge indexes.

Parameters:
to_json(path=None)[source]#
Parameters:

path (str | None)

Return type:

str

to_manifest()[source]#
Return type:

dict[str, Any]

to_numpy(ids=None, *, start=0, stop=None, dtype='float32')[source]#

Materialize vectors into an in-memory NumPy array.

WARNING: Not suitable for huge indexes.

Parameters:
to_yaml(path)[source]#
Parameters:

path (str)

Return type:

None

unbuild()#

Discard the current forest, allowing new items to be added.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

After calling unbuild, you must call build again before running nearest-neighbour queries.

unload()#

Unmap any memory-mapped file backing this index.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

This releases OS-level resources associated with the mmap, but keeps the Python object alive.

verbose(level=1)#

Control verbosity of the underlying C++ index.

Parameters:
levelint, optional (default=1)

Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

Returns:
selfAnnoy

The index itself, allowing chained calls.