PickleMixin#

class scikitplot.annoy.PickleMixin(f=0, metric='angular')[source]#

Adds strict persistence support.

  • ‘byte’ mode: stores serialized bytes, optionally compressed.

  • ‘disk’ mode: stores only the path (requires on_disk_build/load first).

  • ‘auto’ mode: deterministic policy:

    disk if on-disk path is known else byte.

Parameters:
add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:
iint

Integer identifier (indice) for this row. Must be non-negative. Annoy will internally allocate space up to max(i) + 1.

vectorsequence of float

1D embedding of length f. If f == 0 on the first call, the dimension is inferred from vector and fixed for the lifetime of the index.

Returns:
selfAnnoy

The index itself, allowing chained calls, e.g.:

index.add_item(0, v0).add_item(1, v1)

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> index.add_item(0, [1.0, 0.0, 0.0])
>>> index.add_item(1, [0.0, 1.0, 0.0])
build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:
n_treesint

Number of trees in the forest. Larger values typically improve recall at the cost of slower build and query time.

n_jobsint, optional (default=-1)

Number of threads to use while building. -1 means “use all available CPU cores”.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items again, and then rebuild.

References

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

property compress_mode: Literal['zlib', 'gzip'] | None#

!! processed by numpydoc !!

deserialize(byte, prefault=False)#

Restore the index from a serialized byte string.

Parameters:
bytebytes

Byte string produced by serialize.

prefaultbool, optional (default=False)

If True, fault pages into memory while restoring.

Returns:
selfAnnoy

The index itself, allowing chained calls.

f#

number of features (vector dimension)

get_distance(i, j) float#

Return the distance between two stored items.

Parameters:
i, jint

Row identifiers (indices) of two stored samples.

Returns:
dfloat

Distance between items i and j under the current metric (angular, euclidean, manhattan, dot or hamming).

get_item_vector(i) list[float]#

Return the stored embedding vector for a given indice.

Parameters:
iint

Row identifier (indice) previously passed to add_item.

Returns:
vectorlist[float]

Stored embedding of length f.

get_n_items() int#

Return the number of stored samples (rows) in the index.

Returns:
n_itemsint

Number of items that have been added and are currently addressable by nearest-neighbour queries.

get_n_trees() int#

Return the number of trees in the current forest.

Returns:
n_treesint

Number of trees that have been built.

get_nns_by_item(i, n, search_k=-1, include_distances=True)#

Return the n nearest neighbours for a stored sample.

Parameters:
iint

Row identifier (indice) previously passed to add_item(i, embedding).

nint

Number of nearest neighbours to return.

search_kint, optional (default=-1)

Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional (default=True)

If True, return a (indices, distances) tuple, otherwise only the list of indices.

Returns:
indiceslist[int]

Nearest neighbour indices (item ids).

distanceslist[float]

Corresponding distances (only if include_distances=True).

See also

get_nns_by_item

Alias for this method for backward compatibility.

get_nns_by_vector

Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=True)#

Return the n nearest neighbours for a query embedding.

Parameters:
vectorsequence of float

Query embedding of length f.

nint

Number of nearest neighbours to return.

search_kint, optional (default=-1)

Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional (default=True)

If True, return a (indices, distances) tuple, otherwise only the list of indices.

Returns:
indiceslist[int]

Nearest neighbour indices (item ids).

distanceslist[float]

Corresponding distances (only if include_distances=True).

See also

get_nns_by_item

Query by stored sample id.

info() dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Returns:
infodict

Dictionary describing the current index state.

Examples

>>> info = idx.info()
>>> info['dimension']
100
>>> info['n_items']
1000
load(fn, prefault=False)[source]#

Load (mmap) an index from disk into the current object.

Parameters:
fnstr

Path to a file previously created by save or on_disk_build.

prefaultbool, optional (default=False)

If True, fault pages into memory when the file is mapped.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

memory_usage() int#

Approximate memory usage of the index in byte.

Returns:
n_byteint

Approximate number of byte used by the index. When native support is unavailable, this is estimated via serialize.

metric#

distance metric (angular, euclidean, manhattan, dot, hamming)

on_disk_build(fn)[source]#

Configure the index to build using an on-disk backing file.

Parameters:
fnstr

Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:

path (str)

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

property on_disk_path: str | None#

!! processed by numpydoc !!

property pickle_mode: Literal['auto', 'byte', 'disk']#

!! processed by numpydoc !!

property prefault: bool#

!! processed by numpydoc !!

save(fn, prefault=False)[source]#

Persist the index to a binary file on disk.

Parameters:
fnstr

Path to the output file. Existing files will be overwritten.

prefaultbool, optional (default=False)

If True, aggressively fault pages into memory during save. Primarily useful on some platforms for large indexes.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Parameters:
serialize() bytes#

Serialize the full in-memory index into a byte string.

Returns:
bytebytes

Opaque binary blob containing the entire Annoy index.

See also

deserialize

restore an index from byte.

Notes

The serialized form is a snapshot of the internal C++ data structures. It can be stored, transmitted or used with joblib without rebuilding trees.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:
seedint, optional

Non-negative integer seed. If omitted, a library-specific default is used. For strict reproducibility, always call this method explicitly before build.

Returns:
None

This method returns None.

Notes

Using the same seed, data and n_trees usually produces bitwise-identical forests (subject to CPU / threading details).

unbuild()#

Discard the current forest, allowing new items to be added.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

After calling unbuild, you must call build again before running nearest-neighbour queries.

unload()#

Unmap any memory-mapped file backing this index.

Returns:
selfAnnoy

The index itself, allowing chained calls.

Notes

This releases OS-level resources associated with the mmap, but keeps the Python object alive.

verbose(level=1)#

Control verbosity of the underlying C++ index.

Parameters:
levelint, optional (default=1)

Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

Returns:
selfAnnoy

The index itself, allowing chained calls.