Index #

Notes

Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.

For data if fed to_bytes(format='native') required params ``f`, metric.

classmethod from_json(path, *, load=True)[source]#

Load metadata from JSON and construct an index.

Parameters:

path (str | PathLike[str])
load (bool)

Return type:

classmethod from_low_level(obj, *, prefault=None)[source]#

Create a new Index from a low-level instance.

The new object is rebuilt by round-tripping through Annoy’s native serialize / deserialize to avoid sharing low-level state between two Python objects.

Parameters:

objscikitplot.cexternals._annoy.Annoy: Low-level Annoy instance.
prefaultbool or None, default=None: Prefault override passed to deserialize. If None, the value is taken from obj.get_params(deep=False) when available, otherwise it falls back to obj.prefault / destination defaults.

Returns:

indexIndex: Newly constructed high-level index.

Raises:

TypeError: If obj is not an Annoy instance.
RuntimeError: If serialization or deserialization fails, or required configuration (e.g., f) cannot be determined.

Parameters:

obj (Annoy)
prefault (bool | None)

Return type:

See also

Annoy.serialize
Annoy.deserialize
Annoy.get_params
Annoy.set_params

Notes

The implementation uses Annoy’s native serialization. It does not attempt to copy internal pointers or C++ state directly.

This method is deterministic. It always constructs a new index from the serialized payload; it does not share low-level state between objects.

classmethod from_metadata(metadata, *, load=True)[source]#

Construct an index from a metadata payload.

Parameters:

metadataMapping[str, Any]: Payload as produced by to_metadata.
loadbool, default=True: If True and params['on_disk_path'] is present, attempt to load the index into the returned object via backend load.

Returns:

indexSelf: Newly constructed index.

Raises:

TypeError: If input types are invalid.
ValueError: If required fields are missing or invalid.
RuntimeError: If schema version is missing on the class.
AttributeError: If backend set_params/load are missing when required.

Parameters:

metadata (Mapping[str, Any])
load (bool)

Return type:

See also

to_metadata
from_json
from_yaml

classmethod from_yaml(path, *, load=True)[source]#

Load metadata from YAML and construct an index (requires PyYAML).

Parameters:

path (str | PathLike[str])
load (bool)

Return type:

get_distance(i, j) → float#

Return the distance between two stored items.

Parameters:

i, jint: Item ids (index) of two stored samples.

Returns:

dfloat: Distance between items i and j under the current metric.

Raises:

RuntimeError: If the index is not initialized.
IndexError: If either index is out of range.

get_feature_names_out(input_features=None)#

Get output feature names for the transformer-style API.

Parameters:

input_featuressequence of str or None, optional, default=None: If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.

Returns:

tuple of str: Output feature names: ('neighbor_0', ..., 'neighbor_{k-1}') where k == n_neighbors.

Raises:

AttributeError: If called before fit/build.
ValueError: If input_features is provided but does not match feature_names_in_.

get_item_vector(i) → list[float]#

Return the stored embedding vector for a given item id.

Parameters:

iint: Item id (index) previously passed to add_item.

Returns:

vectorlist[float]: Stored embedding of length f.

Raises:

RuntimeError: If the index is not initialized.
IndexError: If i is out of range.

get_item_vectors(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, return_ids=False, validate_vector_len=True)[source]#

Fetch many vectors as a dense NumPy array.

Parameters:

idssequence of int or iterable of int, optional: Ids to fetch. If None, selects range(start, stop or n_items).
dtypenumpy dtype, default=numpy.float32: Output dtype.
start, stopint, optional: Range selection used when ids is None.
n_rowsint, optional: Required when ids is a non-sized iterable (e.g., generator).
return_idsbool, default=False: If True, also return the realized ids (int64) in row order.
validate_vector_lenbool, default=True: If True, verify every fetched vector has length f.

Returns:

Xnumpy.ndarray of shape (n_rows, f): Dense matrix of vectors.
ids_outnumpy.ndarray of shape (n_rows,), optional: Returned when return_ids=True.

Raises:

ValueError: If the id selection is inconsistent or vectors have unexpected length.
TypeError: If ids is a non-sized iterable and n_rows is not provided.

Parameters:

ids (Sequence[int] | Iterable[int] | None)
dtype (Any)
start (int)
stop (int | None)
n_rows (int | None)
return_ids (bool)
validate_vector_len (bool)

Return type:

See also

to_numpy: Dense NumPy export alias.
iter_item_vectors: Streaming export without allocating a dense matrix.

get_n_items() → int#

Return the number of stored items in the index.

Returns:

n_itemsint: Number of items that have been added and are currently addressable.

Raises:

RuntimeError: If the index is not initialized.

get_n_trees() → int#

Return the number of trees in the current forest.

Returns:

n_treesint: Number of trees that have been built.

Raises:

RuntimeError: If the index is not initialized.

get_nns_by_item(i, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a stored item id.

Parameters:

iint: Item id (index) previously passed to add_item(i, embedding).
nint: Number of nearest neighbours to return.
search_kint, optional, default=-1: Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.
include_distancesbool, optional, default=False: If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:

indiceslist[int] | tuple[list[int], list[float]]: If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:

RuntimeError: If the index is not initialized or has not been built.
IndexError: If i is out of range.

See also

get_nns_by_vector: Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a query embedding.

Parameters:

vectorsequence of float: Query embedding of length f.
nint: Number of nearest neighbours to return.
search_kint, optional, default=-1: Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.
include_distancesbool, optional, default=False: If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:

indiceslist[int] | tuple[list[int], list[float]]: If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:

RuntimeError: If the index is not initialized or has not been built.
ValueError: If len(vector) != f.

See also

get_nns_by_item: Query by stored item id.

get_params(deep=True) → dict#

Return estimator-style parameters (scikit-learn compatibility).

Parameters:

deepbool, optional, default=True: Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.

Returns:

paramsdict: Dictionary of stable, user-facing parameters.

See also

set_params: Set estimator-style parameters.
schema_version: Controls pickle / snapshot strategy.

Notes

This is intended to make Annoy behave like a scikit-learn estimator for tools such as sklearn.base.clone and parameter grids.

info(include_n_items=True, include_n_trees=True, include_memory=None) → dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Parameters:

include_n_itemsbool, optional, default=True

If True, include n_items.

include_n_treesbool, optional, default=True

If True, include n_trees.

include_memorybool or None, optional, default=None

Controls whether memory usage fields are included.

None: include memory usage only if the index is built.
True: include memory usage if available (built).
False: omit memory usage fields.

Memory usage is computed after build and may be expensive for very large indexes.

Returns:

infodict: Dictionary describing the current index state.

See also

serialize: Create a binary snapshot of the index.
deserialize: Restore from a binary snapshot.
save: Persist the index to disk.
load: Load the index from disk.

Notes

Some keys are optional depending on include_* flags.

Keys:

fint, default=0
Dimensionality of the index.
metricstr, default=’angular’
Distance metric name.
on_disk_pathstr, default=’’
Path used for on-disk build, if configured.
prefaultbool, default=False
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.
schema_versionint, default=0
Stored schema/version marker on this object (reserved for future use).
seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.
verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:
- <= 0 : quiet (warnings only)
- 1 : info (Annoy’s verbose=True)
- >= 2 : debug (currently same as info, reserved for future use)

Optional Keys:

n_itemsint
Number of items currently stored.
n_treesint
Number of built trees in the forest.
memory_usage_byteint
Approximate memory usage in bytes. Present only when requested and available.
memory_usage_mibfloat
Approximate memory usage in MiB. Present only when requested and available.

Examples

>>> info = idx.info()
>>> info['f']
100
>>> info['n_items']
1000

iter_item_vectors(ids=None, *, start=0, stop=None, with_ids=True, dtype=None)[source]#

Iterate vectors without allocating a dense matrix.

Parameters:

ids, start, stop: Selection controls. See get_item_vectors.
with_idsbool, default=True: If True, yield (id, vector). If False, yield vectors only.
dtypenumpy dtype, optional: If provided, cast output vectors to this dtype.

Yields:

(id, vector) or vector: Each vector is returned as a 1D NumPy array.

Parameters:

ids (Sequence[int] | Iterable[int] | None)
start (int)
stop (int | None)
with_ids (bool)
dtype (Any | None)

Return type:

Iterator[ndarray | tuple[int, ndarray]]

See also

get_item_vectors: Dense export.

kneighbors(X, n_neighbors=5, *, search_k=-1, include_distances=True, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='vector')[source]#

Find k nearest neighbors for one or more query vectors.

This is a sklearn-like convenience wrapper that returns rectangular arrays.

Parameters:

Xarray-like of shape (f,) or (n_queries, f): Query vector(s).
n_neighborsint, default=5: Number of neighbors to return per query.
search_kint, default=-1: Search parameter forwarded to the backend.
include_distancesbool, default=True: If True, return (neighbors, distances). Otherwise return neighbors.
exclude_selfbool, default=False: If True, apply the same deterministic self-exclusion rule as query_by_vector for each query row.
exclude_item_idsiterable of int, optional: Exclude these ids for every query.
ensure_all_finitebool or ‘allow-nan’, default=True: Input validation option forwarded to scikit-learn.
copybool, default=False: Input validation option forwarded to scikit-learn.
output_type{‘item’, ‘vector’}, default=’vector’: If ‘item’, return neighbor ids. If ‘vector’, return neighbor vectors.

Returns:

neighborsnumpy.ndarray: If output_type='item', shape is (n_queries, n_neighbors). If output_type='vector', shape is (n_queries, n_neighbors, f).
distancesnumpy.ndarray of shape (n_queries, n_neighbors): Neighbor distances. Returned when include_distances=True.

Raises:

sklearn.exceptions.NotFittedError: If the backend reports that the index is unbuilt.
ValueError: If n_neighbors <= 0 or any query yields too few neighbors after exclusions.

Parameters:

X (Any)
n_neighbors (int)
search_k (int)
include_distances (bool)
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)
output_type (Literal['item', 'vector'])

Return type:

See also

query_by_vector: Per-query 1D interface.
kneighbors_graph: CSR kNN graph.

kneighbors_graph(X, n_neighbors=5, *, search_k=-1, mode='connectivity', exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='item')[source]#

Compute the k-neighbors graph (CSR) for query vectors.

Parameters:

Xarray-like of shape (f,) or (n_queries, f): Query vector(s).
n_neighborsint, default=5: Number of neighbors per query.
search_kint, default=-1: Search parameter forwarded to the backend.
mode{‘connectivity’, ‘distance’}, default=’connectivity’: If ‘connectivity’, graph entries are 1. If ‘distance’, entries are backend distances.
exclude_selfbool, default=False: If True, apply the same deterministic self-exclusion rule as kneighbors for each query row.
exclude_item_idsiterable of int, optional: Exclude these ids for every query.
ensure_all_finitebool or ‘allow-nan’, default=True: Input validation option forwarded to scikit-learn.
copybool, default=False: Input validation option forwarded to scikit-learn.
output_type{‘item’}, default=’item’: Must be ‘item’ for CSR construction.

Returns:

graphscipy.sparse.csr_matrix: CSR matrix of shape (n_queries, n_items).

Raises:

ImportError: If SciPy is not installed.
ValueError: If mode is invalid or output_type != 'item'.
RuntimeError: If the backend returns an out-of-range neighbor id.

Parameters:

X (Any)
n_neighbors (int)
search_k (int)
mode (Literal['connectivity', 'distance'])
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)
output_type (Literal['item', 'vector'])

Return type:

See also

kneighbors: Dense kNN results.

load(fn, prefault=None)#

Load (mmap) an index from disk into the current object.

Parameters:

fnstr: Path to a file previously created by save or on_disk_build.
prefaultbool or None, optional, default=None: If True, fault pages into memory when the file is mapped. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

IOError: If the file cannot be opened or mapped.
RuntimeError: If the index is not initialized or the file is incompatible.

See also

save: Save the current index to disk.
on_disk_build: Build directly using an on-disk backing file.
unload: Release mmap resources.

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

classmethod load_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#

Load a directory bundle created by save_bundle.

Parameters:

manifest_filename: Filename for the metadata manifest inside the directory.
index_filename: Filename for the Annoy index inside the directory.
prefault: Forwarded to load_index.

Returns:

index: Newly constructed index.

Raises:

AttributeError: If from_json is not available (compose with MetaMixin).
TypeError: If from_json returns an unexpected type.
OSError: On filesystem failures.

Parameters:

manifest_filename (str)
index_filename (str)
prefault (bool | None)

Return type:

classmethod load_index(f, metric, path, *, prefault=None)[source]#

Load (mmap) an Annoy index file into this object.

Parameters:

f: Vector dimension for construction.
metric: Metric name for construction.
pathstr or os.PathLike: Path to a file previously created by save_index or the backend save.
prefault: Forwarded to the backend. If None, the backend default is used.

Raises:

AttributeError: If the backend does not provide load(path, prefault=...).
OSError: If loading fails (backend or filesystem).

Parameters:

f (int)
metric (str)
path (str | PathLike[str])
prefault (bool | None)

Return type:

memory_usage() → int#

Approximate memory usage of the index in bytes.

Returns:

n_bytesint or None: Approximate number of bytes used by the index. Returns None if the index is not initialized or the forest has not been built yet.

Raises:

RuntimeError: If memory usage cannot be computed.

metric#

Distance metric for the index. Valid values:

‘angular’ -> Cosine-like distance on normalized vectors.
‘euclidean’ -> L2 distance.
‘manhattan’ -> L1 distance.
‘dot’ -> Negative dot-product distance (inner product).
‘hamming’ -> Hamming distance for binary vectors.

Aliases (case-insensitive):

angular : cosine
euclidean : l2, lstsq
manhattan : l1, cityblock, taxicab
dot : @, ., dotproduct, inner, innerproduct
hamming : hamming

Returns:

str or None: Canonical metric name, or None if not configured yet.

See also

Notes

Changing metric after the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.

For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and label metadata (y_map and y). You must call fit (or add_item + build) again before querying.

n_features#: Alias of f (dimension), provided for scikit-learn naming parity.

n_features_#: Read-only alias of n_features_in_.

n_features_in_#: Number of features seen during fit (scikit-learn compatible). Alias of f when available.

n_features_out_#: Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.

n_neighbors#: Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_build(fn)#

Configure the index to build using an on-disk backing file.

Parameters:

fnstr: Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:

Annoy: This instance (self), enabling method chaining.

See also

build: Build trees after adding items (on-disk backed).
rebuild: Return a new Annoy index rebuilt from the current index contents.
fit: Build the index from X (preferred if you already have X available).
load: Memory-map the built index.
save: Persist the built index to disk.

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

on_disk_path#

Path used for on-disk build/load/save operations.

Returns:

str or None: Filesystem path used for on-disk operations, or None if not configured.

See also

on_disk_build
load
unload

Notes

Assigning a string/PathLike to on_disk_path configures on-disk build mode (equivalent to calling on_disk_build with the same filename).
Note: Annoy core truncates the target file when enabling on-disk build. on_disk_path is strictly equivalent to calling on_disk_build with the same filename (truncate allowed).
Assigning None (or an empty string) clears the configured path, but only when no disk-backed index is currently active.
Clearing/changing this while an on-disk index is active is disallowed. Call unload first.

property pickle_mode: Literal['auto', 'disk', 'byte']#: Persist strategy used by PickleMixin.

plot_index(labels=None, *, ids=None, projection='pca', dims=(0, 1), center=True, maxabs=False, l2_normalize=False, dtype=<class 'numpy.float32'>, ax=None, title=None, plot_kwargs=None)[source]#

Plot this index as a 2D scatter plot.

This is a thin wrapper around plot_annoy_index that uses _plotting_backend.

Parameters:

labels, ids, projection, dims, center, maxabs, l2_normalize, dtype, ax, title, plot_kwargs: See plot_annoy_index.

Returns:

y2, ids_out, ax: See plot_annoy_index.

Parameters:

labels (Sequence[Any] | None)
ids (Sequence[int] | None)
projection (str)
dims (tuple[int, int])
center (bool)
maxabs (bool)
l2_normalize (bool)
dtype (Any)
ax (Any)
title (str | None)
plot_kwargs (Mapping[str, Any] | None)

Return type:

tuple[ndarray, ndarray, Any]

See also

plot_annoy_index: Low-level plotting helper this method delegates to.
plot_knn_edges: Overlay kNN edges on the returned 2D coordinates.

Notes

This method does not mutate the index.
Plotting backends (e.g. Matplotlib) are imported lazily and are only required when this method is called.
The returned ids_out corresponds to the item id for each row in y2.

Examples

>>> import numpy as np
>>> import scikitplot.annoy as skann
>>> idx = skann.Index(f=10, metric="angular")
>>> # ... add items & build ...
>>> labels = np.zeros(idx.get_n_items(), dtype=int)
>>> y2, ids, ax = idx.plot_index(labels=labels, projection="pca")

plot_knn_edges(y2, *, ids=None, k=10, search_k=-1, ax=None, line_kwargs=None, undirected=True)[source]#

Overlay kNN edges onto an existing 2D index plot.

This is a thin wrapper around plot_annoy_knn_edges that uses _plotting_backend.

Parameters:

y2, ids, k, search_k, ax, line_kwargs, undirected: See plot_annoy_knn_edges.

Returns:

ax: The axes that were drawn on.

Parameters:

y2 (ndarray)
ids (Sequence[int] | None)
k (int)
search_k (int)
ax (Any)
line_kwargs (Mapping[str, Any] | None)
undirected (bool)

Return type:

See also

plot_annoy_knn_edges: Low-level edge overlay helper this method delegates to.
plot_index: Computes the 2D coordinates used as input to this method.

Notes

y2 must represent 2D coordinates with shape (n_samples, 2).
If ids is provided, it must have length n_samples.
This method does not mutate the index; it only performs neighbor queries to draw edges.

Examples

>>> y2, ids, ax = idx.plot_index(labels=np.zeros(idx.get_n_items(), dtype=int))
>>> idx.plot_knn_edges(y2, ids=ids, k=5, line_kwargs={"alpha": 0.15})

prefault#

Default prefault flag stored on the object.

This setting is used as the default for per-call prefault arguments when prefault is omitted or set to None in methods like load and save.

Returns:

bool: Current prefault flag.

Notes

This flag does not retroactively change already-loaded mappings.

query_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#

Query neighbors by stored item id.

Parameters:

itemint: Stored item id.
n_neighborsint: Number of neighbors to return after applying exclusions.
search_kint, default=-1: Search parameter forwarded to the backend.
include_distancesbool, default=False: If True, also return distances.
exclude_selfbool, default=False: If True, exclude item from the returned neighbors.
exclude_item_idsiterable of int, optional: Additional item ids to exclude.
ensure_all_finitebool or ‘allow-nan’, default=True: Input validation option forwarded to scikit-learn.
copybool, default=False: Input validation option forwarded to scikit-learn.

Returns:

indicesnumpy.ndarray of shape (n_neighbors,): Neighbor ids.
(indices, distances)tuple of numpy.ndarray: Returned when include_distances=True.

Raises:

sklearn.exceptions.NotFittedError: If the backend reports that the index is unbuilt.
ValueError: If n_neighbors <= 0 or not enough neighbors remain after exclusions.

Parameters:

item (int)
n_neighbors (int)
search_k (int)
include_distances (bool)
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)

Return type:

See also

query_by_vector: Query neighbors by an explicit vector.
kneighbors: Batch neighbor queries (sklearn-like).

Notes

Exclusions are applied deterministically in the order returned by the backend.

query_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#

Query neighbors by an explicit vector.

Parameters:

vectorarray-like of shape (f,): Query vector.
n_neighborsint: Number of neighbors to return after exclusions.
search_kint, default=-1: Search parameter forwarded to the backend.
include_distancesbool, default=False: If True, also return distances.
exclude_selfbool, default=False: If True, exclude the first returned candidate whose distance is exactly 0.0. This is intended for queries where vector comes from the index itself.
exclude_item_idsiterable of int, optional: Additional item ids to exclude.
ensure_all_finitebool or ‘allow-nan’, default=True: Input validation option forwarded to scikit-learn.
copybool, default=False: Input validation option forwarded to scikit-learn.

Returns:

indicesnumpy.ndarray of shape (n_neighbors,): Neighbor ids.
(indices, distances)tuple of numpy.ndarray: Returned when include_distances=True.

Raises:

sklearn.exceptions.NotFittedError: If the backend reports that the index is unbuilt.
ValueError: If n_neighbors <= 0, vector dimension mismatches f, or not enough neighbors remain after exclusions.

Parameters:

vector (Any)
n_neighbors (int)
search_k (int)
include_distances (bool)
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)

Return type:

See also

query_by_item: Query neighbors by stored item id.
kneighbors: Batch neighbor queries (sklearn-like).

Notes

Exclusions are applied deterministically in the order returned by the backend. If exclude_self=True and no exact 0.0 distance candidate is returned in the first position, no additional self-exclusion is applied.

query_vectors_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#

Query neighbor vectors by stored item id.

This is a convenience wrapper over query_by_item that materializes vectors using the backend’s get_item_vector.

Parameters:

item, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids: See query_by_item.
ensure_all_finite, copy: See query_by_vector.
dtypenumpy dtype, default=numpy.float32: Output dtype for the returned vectors.
output_type{‘item’, ‘vector’}, default=’vector’: If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.

Returns:

vectorsnumpy.ndarray of shape (n_neighbors, f): Neighbor vectors.
(vectors, distances)tuple: Returned when include_distances=True.

Parameters:

item (int)
n_neighbors (int)
search_k (int)
include_distances (bool)
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)
dtype (Any)
output_type (Literal['item', 'vector'])

Return type:

See also

query_vectors_by_vector: Vector query returning vectors (or ids).

query_vectors_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#

Query neighbor vectors by an explicit vector.

Convenience wrapper over query_by_vector. By default it returns vectors; set output_type='item' to return neighbor ids instead.

Parameters:

vector, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids,: See query_by_item.
ensure_all_finite, copy: See query_by_vector.
dtypenumpy dtype, default=numpy.float32: Output dtype for the returned vectors.
output_type{‘item’, ‘vector’}, default=’vector’: If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.

Returns:

neighborsnumpy.ndarray: If output_type='vector', an array of shape (n_neighbors, f). If output_type='item', an array of shape (n_neighbors,).
(neighbors, distances)tuple: Returned when include_distances=True.

Parameters:

vector (Any)
n_neighbors (int)
search_k (int)
include_distances (bool)
exclude_self (bool)
exclude_item_ids (Iterable[int] | None)
ensure_all_finite (bool | Literal['allow-nan'])
copy (bool)
dtype (Any)
output_type (Literal['item', 'vector'])

Return type:

See also

query_vectors_by_item: Item id query returning vectors.
query_by_vector: Per-query id interface.

random_state#: Alias of seed (scikit-learn convention).

rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) → Annoy#

Return a new Annoy index rebuilt from the current index contents.

This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.

Parameters:

metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional

Metric for the new index. If None, reuse the current metric.

on_disk_pathpath-like or None, optional

Optional on-disk build path for the new index.

Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.

n_treesint or None, optional

If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.

n_jobsint, optional, default=-1

Number of threads to use while building (-1 means “auto”).

Returns:

Annoy: A new Annoy instance containing the same items (and label metadata if present).

See also

build: Build trees after adding items (on-disk backed).
on_disk_build: Configure on-disk build mode.
fit: Build the index from X (preferred if you already have X available).
get_params: Read constructor parameters.
set_params: Update estimator parameters (use with fit(X) when refitting from data).
serialize, deserialize: Persist / restore indexes; canonical restores rebuild deterministically.
__sklearn_clone__: Unfitted clone hook (no fitted state).

Notes

rebuild(metric=...) is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.

Use rebuild() when you want to change metric while reusing the already-stored vectors (e.g., you do not want to re-read or re-materialize X, or you loaded an index from disk and only have access to its stored vectors).

repr_info(include_n_items=True, include_n_trees=True, include_memory=None) → str#

Return a dict-like string representation with optional extra fields.

Unlike __repr__, this method can include additional fields on demand. Note that include_memory=True may be expensive for large indexes. Memory is calculated after build.

save(fn, prefault=None)#

Persist the index to a binary file on disk.

Parameters:

fnstr: Path to the output file. Existing files will be overwritten.
prefaultbool or None, optional, default=None: If True, aggressively fault pages into memory during save. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

IOError: If the file cannot be written.
RuntimeError: If the index is not initialized or save fails.

See also

load: Load an index from disk.
on_disk_build: Configure on-disk build mode.
serialize: Snapshot to bytes for in-memory persistence.
deserialize: Restore an index from a serialized byte string.

Notes

The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored prefault setting.

save_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#

Save a directory bundle containing metadata + the index file.

The bundle contains: - manifest.json: metadata payload produced by to_json - index.ann: Annoy index produced by save_index

Parameters:

manifest_filename: Filename for the metadata manifest inside the directory.
index_filename: Filename for the Annoy index inside the directory.
prefault: Forwarded to save_index.

Raises:

AttributeError: If to_json is not available (compose with MetaMixin).
OSError: On filesystem failures.

Parameters:

manifest_filename (str)
index_filename (str)
prefault (bool | None)

Return type:

list[str]

save_index(path, *, prefault=None)[source]#

Persist the Annoy index to disk.

Parameters:

pathstr or os.PathLike: Destination path for the Annoy index file.
prefault: Forwarded to the backend. If None, the backend default is used.

Raises:

AttributeError: If the backend does not provide save(path, prefault=...).
OSError: For filesystem-level failures.

Parameters:

path (str | PathLike[str])
prefault (bool | None)

Return type:

schema_version#

Serialization/compatibility strategy marker sentinel value.

This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.

Returns:

int: Current schema version marker.

Notes

0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).
2: pickle stores canonical-v1 (portable; restores by rebuilding deterministically).
>=3: pickle stores both portable and canonical; canonical is used as a fallback.

seed#: Random seed override (scikit-learn compatible). None means use Annoy default seed.

serialize(format=None) → bytes#

Serialize the built in-memory index into a byte string.

Parameters:

format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format.

“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:

databytes: Opaque binary blob containing the Annoy index.

Raises:

RuntimeError: If the index is not initialized or serialization fails.
OverflowError: If the serialized payload is too large to fit in a Python bytes object.

See also

deserialize: Restore an index from a serialized byte string.
on_disk_build: Configure on-disk build mode.

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

set_params(**params) → Annoy#

Set estimator-style parameters (scikit-learn compatibility).

Parameters:

**params: Keyword parameters to set. Unknown keys raise ValueError.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

ValueError: If an unknown parameter name is provided.
TypeError: If parameter names are not strings or types are invalid.

See also

get_params: Return estimator-style parameters.

Notes

Changing structural parameters (notably metric) on an already initialized index resets the index deterministically (drops all items, trees, and label metadata (y_map and y). Refit/rebuild is required before querying.

This behavior matches scikit-learn expectations: set_params may be called at any time, but parameter changes that affect learned state invalidate the fitted model.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:

seedint or None, optional, default=None

Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 resets to Annoy’s core default seed (with a UserWarning).

If omitted (or None, NULL), the seed is set to Annoy’s default seed.
If 0, clear any pending override and reset to Annoy’s default seed (a UserWarning is emitted).

Returns:

Annoy: This instance (self), enabling method chaining.

See also

seed: Parameter attribute (int | None).

Notes

Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.

set_verbose(verbosity=1)#

Set the verbosity level (callable setter).

This method exists to preserve a callable interface while keeping the parameter name verbose available as an attribute for scikit-learn compatibility.

Parameters:

verbosityint, optional, default=1

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

<= 0 : quiet (warnings only)
1 : info (Annoy’s verbose=True)
>= 2 : debug (currently same as info, reserved for future use)

Returns:

Annoy: This instance (self), enabling method chaining.

See also

verbose: Parameter attribute (int | None).
set_verbosity: Alias of set_verbose.
get_params, set_params: Estimator parameter API.

set_verbosity(level=1)#

Alias of set_verbose.

See also

verbose: Parameter attribute (int | None).
set_verbose: Set the verbosity level (callable setter).

to_bytes(format=None)[source]#

Serialize the built index to bytes (backend serialize).

Parameters:

format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format. If None used "canonical"

“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:

data: Serialized index bytes.

Raises:

AttributeError: If the backend does not provide serialize.
RuntimeError: If serialization fails.
TypeError: If the backend returns non-bytes-like data.

Return type:

bytes

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

to_json(path=None, *, indent=2, sort_keys=True, ensure_ascii=False, include_info=True, strict=True)[source]#

Serialize to_metadata to JSON.

Parameters:

path: If provided, write the JSON to this path atomically.
indent: Indentation level passed to json.dumps.
sort_keys: If True, sort keys for stable output.
ensure_ascii: If True, escape non-ASCII characters.
include_info, strict: Forwarded to to_metadata.

Returns:

json_str: JSON representation of the metadata.

Raises:

TypeError: If the exported metadata contains non-JSON-serializable values.

Parameters:

path (str | PathLike[str] | None)
indent (int)
sort_keys (bool)
ensure_ascii (bool)
include_info (bool)
strict (bool)

Return type:

str

See also

from_json
to_metadata

to_metadata(*, include_info=True, strict=True)[source]#

Export a serializable metadata payload.

Parameters:

include_info: If True, include an info() mapping when available.
strict: If True, failures in optional info() propagation raise.

Returns:

metadata: A JSON/YAML-serializable mapping containing configuration parameters and optional info.

Raises:

RuntimeError: If _META_SCHEMA_VERSION is missing on the concrete class.
TypeError: If get_params does not return a mapping.
AttributeError: If neither the instance nor the backend implements get_params.
TypeError: If a persistence knob (e.g., pickle_mode) is not JSON/YAML-serializable.

Parameters:

include_info (bool)
strict (bool)

Return type:

IndexMetadata

See also

to_json
to_yaml

to_numpy(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#

Export vectors to a dense NumPy array.

See also

get_item_vectors: Dense export with optional id output.
iter_item_vectors: Streaming export.
to_scipy_csr: Export as SciPy CSR.
to_pandas: Export as pandas DataFrame.

Notes

This is an alias of get_item_vectors with return_ids=False.

Parameters:

ids (Sequence[int] | Iterable[int] | None)
dtype (Any)
start (int)
stop (int | None)
n_rows (int | None)
validate_vector_len (bool)

Return type:

ndarray

to_pandas(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, id_location='index', id_name='id', columns=None, validate_vector_len=True)[source]#

Export vectors to a pandas DataFrame.

Parameters:

ids, start, stop, n_rows: Selection controls. See get_item_vectors.
dtypenumpy dtype, default=numpy.float32: Output dtype.
id_location{‘index’, ‘column’, ‘both’, ‘none’}, default=’index’: Where to place ids in the output.
id_namestr, default=’id’: Name used for the id column / index.
columnssequence of str, optional: Column names for vector dimensions. If None, uses feature_names_in_ when present and length matches f; otherwise uses feature_0..feature_{f-1}.
validate_vector_lenbool, default=True: If True, verify every fetched vector has length f.

Returns:

dfpandas.DataFrame: DataFrame with shape (n_rows, f) plus optional id metadata.

Raises:

ImportError: If pandas is not installed.
ValueError: If id_location is invalid or columns length mismatches f.

Parameters:

ids (Sequence[int] | Iterable[int] | None)
dtype (Any)
start (int)
stop (int | None)
n_rows (int | None)
id_location (Literal['index', 'column', 'both', 'none'])
id_name (str)
columns (Sequence[str] | None)
validate_vector_len (bool)

Return type:

See also

to_numpy: Dense NumPy export.
to_scipy_csr: Export as SciPy CSR.

to_scipy_csr(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#

Export vectors as a SciPy CSR matrix.

Returns:

Xscipy.sparse.csr_matrix: CSR matrix with shape (n_rows, f).

Raises:

ImportError: If SciPy is not installed.

Parameters:

ids (Sequence[int] | Iterable[int] | None)
dtype (Any)
start (int)
stop (int | None)
n_rows (int | None)
validate_vector_len (bool)

Return type: