Annoy#

class scikitplot.annoy.Annoy[source]#

Compiled with GCC/Clang. Using 512-bit AVX instructions.

Approximate Nearest Neighbors index (Annoy) with a small, lazy C-extension wrapper.

>>> Annoy(
>>>     f=None,
>>>     metric=None,
>>>     *,
>>>     n_neighbors=5,
>>>     on_disk_path=None,
>>>     prefault=None,
>>>     seed=None,
>>>     verbose=None,
>>>     schema_version=None,
>>> )

Parameters:

fint or None, optional, default=None

Vector dimension. If 0 or None, dimension may be inferred from the first vector passed to add_item (lazy mode). If None, treated as 0 (reset to default).

metric{“angular”, “cosine”, “euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”} or None, optional, default=None

Distance metric (one of ‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’). If omitted and f > 0, defaults to 'angular' (cosine-like). If omitted and f == 0, metric may be set later before construction. If None, behavior depends on f:

If f > 0: defaults to 'angular' (legacy behavior; may emit a FutureWarning).
If f == 0: leaves the metric unset (lazy). You may set metric later before construction, or it will default to 'angular' on first add_item.

n_neighborsint, default=5

Non-negative integer Number of neighbors to retrieve for each query.

on_disk_pathstr or None, optional, default=None

If provided, configures the path for on-disk building. When the underlying index exists, this enables on-disk build mode (equivalent to calling on_disk_build with the same filename).

Note: Annoy core truncates the target file when enabling on-disk build. This wrapper treats on_disk_path as strictly equivalent to calling on_disk_build with the same filename (truncate allowed).

In lazy mode (f==0 and/or metric is None), activation occurs once the underlying C++ index is created.

prefaultbool or None, optional, default=None

If True, request page-faulting index pages into memory when loading (when supported by the underlying platform/backing). If None, treated as False (reset to default).

seedint or None, optional, default=None

Non-negative integer seed. If set before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 is treated as "use Annoy’s deterministic default seed" (a UserWarning is emitted when 0 is explicitly provided).

verboseint or None, optional, default=None

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

<= 0 : quiet (warnings only)
1 : info (Annoy’s verbose=True)
>= 2 : debug (currently same as info, reserved for future use)

schema_versionint, optional, default=None

Serialization/compatibility strategy marker.

This does not change the Annoy on-disk format, but it does control how the index is snapshotted in pickles.

0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).
2: pickle stores canonical-v1 (portable across ABIs; restores by rebuilding deterministically).
>=3: pickle stores both portable and canonical (canonical is used as a fallback if the ABI check fails).

If None, treated as 0 (reset to default).

Attributes:

fint, default=0: Vector dimension.
metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’}, default=”angular”: Distance metric for the index.
n_neighborsint, default=5: Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).
on_disk_pathstr or None, optional, default=None: Path used for on-disk build/load/save operations.
prefaultbool, default=False: Default prefault flag stored on the object.
seedint or None, optional, default=None: Random seed override (scikit-learn compatible).
verboseint or None, optional, default=None: Verbosity level in [-2, 2] or None (unset).
schema_versionint, default=0: Serialization/compatibility strategy marker sentinel value.
n_featuresint: Alias of f (dimension), provided for scikit-learn naming parity.
n_features_out_int: Number of output features produced by transform (SLEP013).
feature_names_in_list-like: Input feature names seen during fit (SLEP007).
ylist-like | None, optional, default=None: y : list[object] | None
y_mapdict | None, optional, default=None: y_map : dict[int, object] | None

See also

add_item: Add a vector to the index.
build: Build the forest after adding items.
unbuild: Remove trees to allow adding more items.
get_nns_by_item, get_nns_by_vector: Query nearest neighbours.
save, load: Persist the index to/from disk.
serialize, deserialize: Persist the index to/from bytes.
set_seed: Set the random seed deterministically.
set_verbose: Set verbosity level.
info: Return a structured summary of the current index.

Notes

Once the underlying C++ index is created, f and metric are immutable. This keeps the object consistent and avoids undefined behavior.
The C++ index is created lazily when sufficient information is available: when both f > 0 and metric are known, or when an operation that requires the index is first executed.
If f == 0, the dimensionality is inferred from the first non-empty vector passed to add_item and is then fixed for the lifetime of the index.
Assigning None to f is not supported. Use 0 for lazy inference (this matches Annoy(f=None, ...) at construction time).
If metric is omitted while f > 0, the current behavior defaults to 'angular' and may emit a FutureWarning. To avoid warnings and future behavior changes, always pass metric=... explicitly.
Items must be added before calling build. After build, the index becomes read-only; to add more items, call unbuild, add items again with add_item, then call build again.
Very large indexes can be built directly on disk with on_disk_build and then memory-mapped with load.
info returns a structured summary (dimension, metric, counts, and optional memory usage) suitable for programmatic inspection.
This wrapper stores user configuration (e.g., seed/verbosity) even before the C++ index exists and applies it deterministically upon construction.

Developer Notes:

Source of truth:
- f (int) and metric_id (enum) describe configuration.
- ptr is NULL when index is not constructed.
Invariant:
- ptr != NULL implies f > 0 and metric_id != METRIC_UNKNOWN.

Examples

>>> from annoy import Annoy, AnnoyIndex

High-level API:

>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
>>> from scikitplot.annoy import Annoy, AnnoyIndex, Index

The lifecycle follows the examples in test.ipynb:

Construct the index

>>> import random; random.seed(0)
>>> # from annoy import AnnoyIndex
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
>>> from scikitplot.annoy import Annoy, AnnoyIndex, Index

>>> idx = Annoy(f=3, metric="angular")
>>> idx.f, idx.metric
(3, 'angular')

If you pass f=0 the dimension can be inferred on the first call to add_item.

Add items

>>> idx.add_item(0, [1.0, 0.0, 0.0])
>>> idx.add_item(1, [0.0, 1.0, 0.0])
>>> idx.add_item(2, [0.0, 0.0, 1.0])
>>> idx.get_n_items()
3

Build the forest

>>> idx.build(n_trees=-1)
>>> idx.get_n_trees()
10
>>> idx.memory_usage()  # byte
543076

After build the index becomes read-only. You can still query, save, load and serialize it.

Query neighbours

By stored item id:

>>> idx.get_nns_by_item(0, 5)
[0, 1, 2, ...]

With distances:

>>> idx.get_nns_by_item(0, 5, include_distances=True)
([0, 1, 2, ...], [0.0, 1.22, 1.26, ...])

Or by an explicit query vector:

>>> idx.get_nns_by_vector([0.1, 0.2, 0.3], 5, include_distances=True)
([103, 71, 160, 573, 672], [...])

Persistence

To work with memory-mapped indices on disk:

>>> idx.save("annoy_test.annoy")
>>> idx2 = Annoy(f=100, metric="angular")
>>> idx2.load("annoy_test.annoy")
>>> idx2.get_n_items()
1000

Or via raw byte:

>>> buf = idx.serialize()
>>> new_idx = Annoy(f=100, metric="angular")
>>> new_idx.deserialize(buf)
>>> new_idx.get_n_items()
1000

You can release OS resources with unload and drop the current forest with unbuild.

add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:

iint: Item id (index) must be non-negative. Ids may be non-contiguous; the index allocates up to max(i) + 1.
vectorsequence of float: 1D embedding of length f. Values are converted to float. If f == 0 and this is the first item, f is inferred from vector and then fixed for the lifetime of this index.

Returns:

Annoy: This instance (self), enabling method chaining.

See also

build: Build the forest after adding items.
unbuild: Remove trees to allow adding more items.
get_nns_by_item, get_nns_by_vector: Query nearest neighbours.

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)

build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:

n_treesint

Number of trees in the forest. Larger values typically improve recall at the cost of slower build time and higher memory usage.

If set to n_trees=-1, trees are built dynamically until the index reaches approximately twice the number of items _n_nodes >= 2 * n_items.

Guidelines:

Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.

n_jobsint, optional, default=-1

Number of threads to use while building. -1 means “auto” (use the implementation’s default, typically all available CPU cores).

Returns:

Annoy: This instance (self), enabling method chaining.

See also

fit: Build the index from X (preferred if you already have X available).
add_item: Add vectors before building.
unbuild: Drop trees to add more items.
rebuild: Return a new Annoy index rebuilt from the current index contents.
on_disk_build: Configure on-disk build mode.
get_nns_by_item, get_nns_by_vector: Query nearest neighbours.
save, load: Persist the index to/from disk.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items, and then rebuild.

References

[1]

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)
>>> idx.build(10)

deserialize(byte, prefault=None)#

Restore the index from a serialized byte string.

Parameters:

bytebytes: Byte string produced by serialize. Both native (legacy) blobs and portable blobs (created with serialize(format='portable')) are accepted; portable and canonical blobs are auto-detected. Canonical blobs restore by rebuilding the index deterministically.
prefaultbool or None, optional, default=None: Accepted for API symmetry with load. If None, the stored Ignored for canonical blobs. prefault value is used.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

IOError: If deserialization fails due to invalid or incompatible data.
RuntimeError: If the index is not initialized.

See also

serialize: Create a binary snapshot of the index.
on_disk_build: Configure on-disk build mode.

Notes

Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.

f#

Vector dimension.

Returns:

int: Dimension of each item vector. 0 means unknown / lazy.

Notes

Annoy(f=None, ...) is supported at construction time and is treated as f=0.
0 (or None) means “unknown / lazy”: the first call to add_item will infer f from the input vector length and then fix it.

Changing f after the index has been initialized (items added and/or trees built) is a structural change: the stored items and all tree splits depend on the vector dimension.

For scikit-learn compatibility, assigning a different f (or None) on an already initialized index will deterministically reset the index (drop all items, trees, and label metadata (y_map and y)). You must call fit (or add_item + build) again before querying.

feature_names_in_#: Input feature names seen during fit (SLEP007). Set only when explicitly provided via fit(…, feature_names=…).

fit(X=None, y=None, *, y_map=None, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None)#

Fit the Annoy index (scikit-learn compatible).

This method supports two deterministic workflows:

Manual add/build: If X is None and y is None, fit() builds the forest using items previously added via add_item().
Array-like X: If X is provided (2D array-like), fit() optionally resets or appends, adds all rows as items, then builds the forest.

Parameters:

Xarray-like of shape (n_samples, n_features), default=None

Vectors to add to the index. If None (and y is None), fit() only builds.

yarray-like of shape (n_samples,), default=None

Optional dense labels/targets associated with X. This must be a 1D sequence (dicts are not accepted; use y_map).

y_mapdict[int, object] or None, default=None

Optional sparse mapping {item_id -> label/target}. This is the canonical label metadata storage. Provide only one of y or y_map.

n_treesint, default=-1

Number of trees to build. Use -1 for Annoy’s internal default.

n_jobsint, default=-1

Number of threads to use during build (-1 means “auto”).

resetbool, default=True

If True, clear existing items before adding X. If False, append.

start_indexint or None, default=None

Item id for the first row of X. If None, uses 0 when reset=True, otherwise uses current n_items when reset=False.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X.

Dense rows: replaces None elements with missing_value.
Dict rows: fills missing keys (and None values) with missing_value.

If None, missing entries raise an error (strict mode).

Returns:

Annoy: This instance (self), enabling method chaining.

See also

fit_transform: Estimator-style APIs.
transform: Query the built index.
add_item: Add one item at a time.
build: Build the forest after manual calls to add_item.
on_disk_build: Configure on-disk build mode.
unbuild: Remove trees so items can be appended.
y_map, y: Stored label metadata (canonical mapping and dense cache).
get_params, set_params: Estimator parameter API.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx = AnnoyIndex().set_params(metric=m).fit(X)
...     print(m, idx.transform(q))
...
>>> idx = AnnoyIndex().fit(X)
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx_m = base.rebuild(metric=m)  # rebuild-from-index
...     print(m, idx_m.transform(q))  # no .fit(X) here

fit_transform(X, y=None, *, y_map=None, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None, n_neighbors=None, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None)#

Fit the index and transform X in a single deterministic call.

This is equivalent to calling fit followed by transform.

Parameters:

Xarray-like: Training data / queries. See fit and transform.
yarray-like of shape (n_samples,), default=None: Optional dense labels/targets aligned to rows of X.
y_mapdict[int, object] or None, default=None: Optional sparse mapping {item_id -> label/target}. Provide only one of y or y_map.

Returns:

neighborsobject: The output of transform for X under the provided query options.

Raises:

TypeError: If both y and y_map are provided, or if input types are invalid.
ValueError: If the index cannot be built deterministically or query options are inconsistent.

See also

fit: Build the index from X (preferred if you already have X available).
transform: Query the built index.
on_disk_build: Configure on-disk build mode.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     print(m, AnnoyIndex().set_params(metric=m).fit_transform(q))

get_distance(i, j) → float#

Return the distance between two stored items.

Parameters:

i, jint: Item ids (index) of two stored samples.

Returns:

dfloat: Distance between items i and j under the current metric.

Raises:

RuntimeError: If the index is not initialized.
IndexError: If either index is out of range.

get_feature_names_out(input_features=None)#

Get output feature names for the transformer-style API.

Parameters:

input_featuressequence of str or None, optional, default=None: If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.

Returns:

tuple of str: Output feature names: ('neighbor_0', ..., 'neighbor_{k-1}') where k == n_neighbors.

Raises:

AttributeError: If called before fit/build.
ValueError: If input_features is provided but does not match feature_names_in_.

get_item_vector(i) → list[float]#

Return the stored embedding vector for a given item id.

Parameters:

iint: Item id (index) previously passed to add_item.

Returns:

vectorlist[float]: Stored embedding of length f.

Raises:

RuntimeError: If the index is not initialized.
IndexError: If i is out of range.

get_n_items() → int#

Return the number of stored items in the index.

Returns:

n_itemsint: Number of items that have been added and are currently addressable.

Raises:

RuntimeError: If the index is not initialized.

get_n_trees() → int#

Return the number of trees in the current forest.

Returns:

n_treesint: Number of trees that have been built.

Raises:

RuntimeError: If the index is not initialized.

get_nns_by_item(i, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a stored item id.

Parameters:

iint: Item id (index) previously passed to add_item(i, embedding).
nint: Number of nearest neighbours to return.
search_kint, optional, default=-1: Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.
include_distancesbool, optional, default=False: If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:

indiceslist[int] | tuple[list[int], list[float]]: If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:

RuntimeError: If the index is not initialized or has not been built.
IndexError: If i is out of range.

See also

get_nns_by_vector: Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a query embedding.

Parameters:

vectorsequence of float: Query embedding of length f.
nint: Number of nearest neighbours to return.
search_kint, optional, default=-1: Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.
include_distancesbool, optional, default=False: If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:

indiceslist[int] | tuple[list[int], list[float]]: If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:

RuntimeError: If the index is not initialized or has not been built.
ValueError: If len(vector) != f.

See also

get_nns_by_item: Query by stored item id.

get_params(deep=True) → dict#

Return estimator-style parameters (scikit-learn compatibility).

Parameters:

deepbool, optional, default=True: Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.

Returns:

paramsdict: Dictionary of stable, user-facing parameters.

See also

set_params: Set estimator-style parameters.
schema_version: Controls pickle / snapshot strategy.

Notes

This is intended to make Annoy behave like a scikit-learn estimator for tools such as sklearn.base.clone and parameter grids.

info(include_n_items=True, include_n_trees=True, include_memory=None) → dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Parameters:

include_n_itemsbool, optional, default=True

If True, include n_items.

include_n_treesbool, optional, default=True

If True, include n_trees.

include_memorybool or None, optional, default=None

Controls whether memory usage fields are included.

None: include memory usage only if the index is built.
True: include memory usage if available (built).
False: omit memory usage fields.

Memory usage is computed after build and may be expensive for very large indexes.

Returns:

infodict: Dictionary describing the current index state.

See also

serialize: Create a binary snapshot of the index.
deserialize: Restore from a binary snapshot.
save: Persist the index to disk.
load: Load the index from disk.

Notes

Some keys are optional depending on include_* flags.

Keys:

fint, default=0
Dimensionality of the index.
metricstr, default=’angular’
Distance metric name.
on_disk_pathstr, default=’’
Path used for on-disk build, if configured.
prefaultbool, default=False
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.
schema_versionint, default=0
Stored schema/version marker on this object (reserved for future use).
seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.
verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:
- <= 0 : quiet (warnings only)
- 1 : info (Annoy’s verbose=True)
- >= 2 : debug (currently same as info, reserved for future use)

Optional Keys:

n_itemsint
Number of items currently stored.
n_treesint
Number of built trees in the forest.
memory_usage_byteint
Approximate memory usage in bytes. Present only when requested and available.
memory_usage_mibfloat
Approximate memory usage in MiB. Present only when requested and available.

Examples

>>> info = idx.info()
>>> info['f']
100
>>> info['n_items']
1000

load(fn, prefault=None)#

Load (mmap) an index from disk into the current object.

Parameters:

fnstr: Path to a file previously created by save or on_disk_build.
prefaultbool or None, optional, default=None: If True, fault pages into memory when the file is mapped. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

IOError: If the file cannot be opened or mapped.
RuntimeError: If the index is not initialized or the file is incompatible.

See also

save: Save the current index to disk.
on_disk_build: Build directly using an on-disk backing file.
unload: Release mmap resources.

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

memory_usage() → int#

Approximate memory usage of the index in bytes.

Returns:

n_bytesint or None: Approximate number of bytes used by the index. Returns None if the index is not initialized or the forest has not been built yet.

Raises:

RuntimeError: If memory usage cannot be computed.

metric#

Distance metric for the index. Valid values:

‘angular’ -> Cosine-like distance on normalized vectors.
‘euclidean’ -> L2 distance.
‘manhattan’ -> L1 distance.
‘dot’ -> Negative dot-product distance (inner product).
‘hamming’ -> Hamming distance for binary vectors.

Aliases (case-insensitive):

angular : cosine
euclidean : l2, lstsq
manhattan : l1, cityblock, taxicab
dot : @, ., dotproduct, inner, innerproduct
hamming : hamming

Returns:

str or None: Canonical metric name, or None if not configured yet.

See also

Notes

Changing metric after the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.

For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and label metadata (y_map and y). You must call fit (or add_item + build) again before querying.

n_features#: Alias of f (dimension), provided for scikit-learn naming parity.

n_features_#: Read-only alias of n_features_in_.

n_features_in_#: Number of features seen during fit (scikit-learn compatible). Alias of f when available.

n_features_out_#: Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.

n_neighbors#: Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_build(fn)#

Configure the index to build using an on-disk backing file.

Parameters:

fnstr: Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:

Annoy: This instance (self), enabling method chaining.

See also

build: Build trees after adding items (on-disk backed).
rebuild: Return a new Annoy index rebuilt from the current index contents.
fit: Build the index from X (preferred if you already have X available).
load: Memory-map the built index.
save: Persist the built index to disk.

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

on_disk_path#

Path used for on-disk build/load/save operations.

Returns:

str or None: Filesystem path used for on-disk operations, or None if not configured.

See also

on_disk_build
load
unload

Notes

Assigning a string/PathLike to on_disk_path configures on-disk build mode (equivalent to calling on_disk_build with the same filename).
Note: Annoy core truncates the target file when enabling on-disk build. on_disk_path is strictly equivalent to calling on_disk_build with the same filename (truncate allowed).
Assigning None (or an empty string) clears the configured path, but only when no disk-backed index is currently active.
Clearing/changing this while an on-disk index is active is disallowed. Call unload first.

prefault#

Default prefault flag stored on the object.

This setting is used as the default for per-call prefault arguments when prefault is omitted or set to None in methods like load and save.

Returns:

bool: Current prefault flag.

Notes

This flag does not retroactively change already-loaded mappings.

random_state#: Alias of seed (scikit-learn convention).

rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) → Annoy#

Return a new Annoy index rebuilt from the current index contents.

This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.

Parameters:

metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional

Metric for the new index. If None, reuse the current metric.

on_disk_pathpath-like or None, optional

Optional on-disk build path for the new index.

Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.

n_treesint or None, optional

If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.

n_jobsint, optional, default=-1

Number of threads to use while building (-1 means “auto”).

Returns:

Annoy: A new Annoy instance containing the same items (and label metadata if present).

See also

build: Build trees after adding items (on-disk backed).
on_disk_build: Configure on-disk build mode.
fit: Build the index from X (preferred if you already have X available).
get_params: Read constructor parameters.
set_params: Update estimator parameters (use with fit(X) when refitting from data).
serialize, deserialize: Persist / restore indexes; canonical restores rebuild deterministically.
__sklearn_clone__: Unfitted clone hook (no fitted state).

Notes

rebuild(metric=...) is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.

Use rebuild() when you want to change metric while reusing the already-stored vectors (e.g., you do not want to re-read or re-materialize X, or you loaded an index from disk and only have access to its stored vectors).

repr_info(include_n_items=True, include_n_trees=True, include_memory=None) → str#

Return a dict-like string representation with optional extra fields.

Unlike __repr__, this method can include additional fields on demand. Note that include_memory=True may be expensive for large indexes. Memory is calculated after build.

save(fn, prefault=None)#

Persist the index to a binary file on disk.

Parameters:

fnstr: Path to the output file. Existing files will be overwritten.
prefaultbool or None, optional, default=None: If True, aggressively fault pages into memory during save. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

IOError: If the file cannot be written.
RuntimeError: If the index is not initialized or save fails.

See also

load: Load an index from disk.
on_disk_build: Configure on-disk build mode.
serialize: Snapshot to bytes for in-memory persistence.
deserialize: Restore an index from a serialized byte string.

Notes

The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored prefault setting.

schema_version#

Serialization/compatibility strategy marker sentinel value.

This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.

Returns:

int: Current schema version marker.

Notes

0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).
2: pickle stores canonical-v1 (portable; restores by rebuilding deterministically).
>=3: pickle stores both portable and canonical; canonical is used as a fallback.

seed#: Random seed override (scikit-learn compatible). None means use Annoy default seed.

serialize(format=None) → bytes#

Serialize the built in-memory index into a byte string.

Parameters:

format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format.

“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:

databytes: Opaque binary blob containing the Annoy index.

Raises:

RuntimeError: If the index is not initialized or serialization fails.
OverflowError: If the serialized payload is too large to fit in a Python bytes object.

See also

deserialize: Restore an index from a serialized byte string.
on_disk_build: Configure on-disk build mode.

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

set_params(**params) → Annoy#

Set estimator-style parameters (scikit-learn compatibility).

Parameters:

**params: Keyword parameters to set. Unknown keys raise ValueError.

Returns:

Annoy: This instance (self), enabling method chaining.

Raises:

ValueError: If an unknown parameter name is provided.
TypeError: If parameter names are not strings or types are invalid.

See also

get_params: Return estimator-style parameters.

Notes

Changing structural parameters (notably metric) on an already initialized index resets the index deterministically (drops all items, trees, and label metadata (y_map and y). Refit/rebuild is required before querying.

This behavior matches scikit-learn expectations: set_params may be called at any time, but parameter changes that affect learned state invalidate the fitted model.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:

seedint or None, optional, default=None

Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 resets to Annoy’s core default seed (with a UserWarning).

If omitted (or None, NULL), the seed is set to Annoy’s default seed.
If 0, clear any pending override and reset to Annoy’s default seed (a UserWarning is emitted).

Returns:

Annoy: This instance (self), enabling method chaining.

See also

seed: Parameter attribute (int | None).

Notes

Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.

set_verbose(verbosity=1)#

Set the verbosity level (callable setter).

This method exists to preserve a callable interface while keeping the parameter name verbose available as an attribute for scikit-learn compatibility.

Parameters:

verbosityint, optional, default=1

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

<= 0 : quiet (warnings only)
1 : info (Annoy’s verbose=True)
>= 2 : debug (currently same as info, reserved for future use)

Returns:

Annoy: This instance (self), enabling method chaining.

See also

verbose: Parameter attribute (int | None).
set_verbosity: Alias of set_verbose.
get_params, set_params: Estimator parameter API.

set_verbosity(level=1)#

Alias of set_verbose.

See also

verbose: Parameter attribute (int | None).
set_verbose: Set the verbosity level (callable setter).

transform(X, *, n_neighbors=5, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None, input_type='vector', output_type='vector', exclude_self=False, exclude_items=None, missing_value=None)#

Transform queries into nearest-neighbor results (ids or vectors; optional distances / labels).

Parameters:

Xarray-like

Query inputs. The expected shape/type depends on input_type:

input_type=’item’ : X must be a 1D sequence of item ids.
input_type=’vector’: X must be a 2D array-like of shape (n_queries, f).

n_neighborsint or None, default=5

Number of neighbors to retrieve for each query. For backwards compatibility this keyword is accepted, but it must match the estimator parameter n_neighbors (STRICT schema).

search_kint, default=-1

Search parameter passed to Annoy (-1 uses Annoy’s default).

include_distancesbool, default=False

If True, also return per-neighbor distances.

return_labelsbool, default=False

If True, also return per-neighbor labels resolved from y_map (or y cache).

y_fill_valueobject, default=None

Value used when label metadata is unset or missing an entry for a neighbor id.

input_type{‘vector’, ‘item’}, default=’vector’

Controls how X is interpreted.

output_type{‘vector’, ‘item’}, default=’vector’

Controls what neighbors are returned.

output_type=’item’: return neighbor ids.
output_type=’vector’: return neighbor vectors.

exclude_selfbool, default=False

If True, exclude the query item id from results. Only valid when input_type=’item’.

exclude_itemssequence of int or None, default=None

Explicit neighbor ids to exclude from results.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X (None values in dense rows; missing keys / None values in dict rows). If None, missing entries raise.

Returns:

neighborslist: Neighbor results for each query. - output_type=’item’ : list of list of int - output_type=’vector’: list of list of list of float
(neighbors, distances)tuple: Returned when include_distances=True.
(neighbors, labels)tuple: Returned when return_labels=True.
(neighbors, distances, labels)tuple: Returned when include_distances=True and return_labels=True.

See also

get_nns_by_item: Neighbor search by item id.
get_nns_by_vector: Neighbor search by query vector.
fit: Build the index from X (preferred if you already have X available).
fit_transform: Estimator-style APIs.

Notes

Excluding self is performed by matching neighbor ids to the query id (not by checking distance values).
For input_type=’vector’, exclude_self=True is an error; use exclude_items for explicit, deterministic filtering.
If exclusions prevent returning exactly n_neighbors results, this method raises ValueError.

Examples

Item queries (exclude the query id itself):

>>> idx.transform([10, 20], input_type='item', output_type='item', n_neighbors=5, exclude_self=True)

Vector queries (exclude explicit ids):

>>> idx.transform(X_query, input_type='vector', output_type='item', n_neighbors=5, exclude_items=[10, 20])

Return neighbor vectors:

>>> idx.transform([10], input_type='item', output_type='vector', n_neighbors=5, exclude_self=True)

unbuild()#

Discard the current forest, allowing new items to be added.

Returns:

Annoy: This instance (self), enabling method chaining.

See also

build: Rebuild the forest after adding new items.
rebuild: Return a new Annoy index rebuilt from the current index contents.
fit: Build the index from X (preferred if you already have X available).
add_item: Add items (only valid when no trees are built).

Notes

After calling unbuild, you must call build again before running nearest-neighbour queries.

unload()#

Unmap any memory-mapped file backing this index.

Returns:

Annoy: This instance (self), enabling method chaining.

See also

load: Memory-map an on-disk index into this object.
on_disk_build: Configure on-disk build mode.

Notes

This releases OS-level resources associated with the mmap, but keeps the Python object alive.

verbose#

set_verbose().

Type:: Verbosity level in [-2, 2] or None (unset). Callable setter

y#

list[object] | None: Dense labels/targets aligned to item ids 0..n_items-1.

Returns:

ylist[object] | None: A Python list of length n_items (missing labels are None), or None if no label metadata is available.

Raises:

TypeError: If assigned a dict. Use y_map for dict mappings.
ValueError: If assigned a sequence whose length does not match n_items when the index already contains items.

See also

y_map: Canonical sparse mapping of labels by item id.
fit, fit_transform: Set labels while fitting.

Notes

y_map is the canonical storage. y is a convenience cache that may be cleared and materialized from y_map on demand.
Setting y replaces y_map deterministically.

Type:: y

y_map#

dict[int, object] | None: Sparse mapping {item_id -> label/target}.

Returns:

y_mapdict[int, object] | None: Mapping of labels keyed by item id, or None if unset.

Raises:

TypeError: If assigned a non-dict (other than None).
ValueError: If any key is negative, not an integer, or (when the index already contains items) out of range.

See also

y: Dense cache aligned to item ids.
fit, fit_transform: Set labels while fitting.
transform: Use return_labels=True to return labels.

Notes

This is the canonical label metadata storage.
Missing keys imply “no label”; when y is materialized, missing ids become None.

Type:: y_map

Gallery examples#

annoy.Annoy legacy c-api with examples

annoy.Index python-api with examples

Annoy#

Gallery examples#

This Page