Annoy#

class scikitplot.annoy.Annoy[source]#

Compiled with GCC/Clang. Using 512-bit AVX instructions.

Approximate Nearest Neighbors index (Annoy) with a small, lazy C-extension wrapper.

>>> Annoy(
>>>     f=None,
>>>     metric=None,
>>>     *,
>>>     n_neighbors=5,
>>>     on_disk_path=None,
>>>     prefault=None,
>>>     seed=None,
>>>     verbose=None,
>>>     schema_version=None,
>>> )
Parameters:
fint or None, optional, default=None

Vector dimension. If 0 or None, dimension may be inferred from the first vector passed to add_item (lazy mode). If None, treated as 0 (reset to default).

metric{“angular”, “cosine”, “euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”} or None, optional, default=None

Distance metric (one of ‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’). If omitted and f > 0, defaults to 'angular' (cosine-like). If omitted and f == 0, metric may be set later before construction. If None, behavior depends on f:

  • If f > 0: defaults to 'angular' (legacy behavior; may emit a FutureWarning).

  • If f == 0: leaves the metric unset (lazy). You may set metric later before construction, or it will default to 'angular' on first add_item.

n_neighborsint, default=5

Non-negative integer Number of neighbors to retrieve for each query.

on_disk_pathstr or None, optional, default=None

If provided, configures the path for on-disk building. When the underlying index exists, this enables on-disk build mode (equivalent to calling on_disk_build with the same filename).

Note: Annoy core truncates the target file when enabling on-disk build. This wrapper treats on_disk_path as strictly equivalent to calling on_disk_build with the same filename (truncate allowed).

In lazy mode (f==0 and/or metric is None), activation occurs once the underlying C++ index is created.

prefaultbool or None, optional, default=None

If True, request page-faulting index pages into memory when loading (when supported by the underlying platform/backing). If None, treated as False (reset to default).

seedint or None, optional, default=None

Non-negative integer seed. If set before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 is treated as "use Annoy’s deterministic default seed" (a UserWarning is emitted when 0 is explicitly provided).

verboseint or None, optional, default=None

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

schema_versionint, optional, default=None

Serialization/compatibility strategy marker.

This does not change the Annoy on-disk format, but it does control how the index is snapshotted in pickles.

  • 0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).

  • 2: pickle stores canonical-v1 (portable across ABIs; restores by rebuilding deterministically).

  • >=3: pickle stores both portable and canonical (canonical is used as a fallback if the ABI check fails).

If None, treated as 0 (reset to default).

Attributes:
fint, default=0

Vector dimension.

metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’}, default=”angular”

Distance metric for the index.

n_neighborsint, default=5

Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_pathstr or None, optional, default=None

Path used for on-disk build/load/save operations.

prefaultbool, default=False

Default prefault flag stored on the object.

seedint or None, optional, default=None

Random seed override (scikit-learn compatible).

verboseint or None, optional, default=None

Verbosity level in [-2, 2] or None (unset).

schema_versionint, default=0

Serialization/compatibility strategy marker sentinel value.

n_featuresint

Alias of f (dimension), provided for scikit-learn naming parity.

n_features_out_int

Number of output features produced by transform (SLEP013).

feature_names_in_list-like

Input feature names seen during fit (SLEP007).

ydict | None, optional, default=None

Labels / targets associated with the index items.

See also

add_item

Add a vector to the index.

build

Build the forest after adding items.

unbuild

Remove trees to allow adding more items.

get_nns_by_item, get_nns_by_vector

Query nearest neighbours.

save, load

Persist the index to/from disk.

serialize, deserialize

Persist the index to/from bytes.

set_seed

Set the random seed deterministically.

verbose

Set verbosity level.

info

Return a structured summary of the current index.

Notes

  • Once the underlying C++ index is created, f and metric are immutable. This keeps the object consistent and avoids undefined behavior.

  • The C++ index is created lazily when sufficient information is available: when both f > 0 and metric are known, or when an operation that requires the index is first executed.

  • If f == 0, the dimensionality is inferred from the first non-empty vector passed to add_item and is then fixed for the lifetime of the index.

  • Assigning None to f is not supported. Use 0 for lazy inference (this matches Annoy(f=None, ...) at construction time).

  • If metric is omitted while f > 0, the current behavior defaults to 'angular' and may emit a FutureWarning. To avoid warnings and future behavior changes, always pass metric=... explicitly.

  • Items must be added before calling build. After build, the index becomes read-only; to add more items, call unbuild, add items again with add_item, then call build again.

  • Very large indexes can be built directly on disk with on_disk_build and then memory-mapped with load.

  • info returns a structured summary (dimension, metric, counts, and optional memory usage) suitable for programmatic inspection.

  • This wrapper stores user configuration (e.g., seed/verbosity) even before the C++ index exists and applies it deterministically upon construction.

Developer Notes:

  • Source of truth:

    • f (int) and metric_id (enum) describe configuration.

    • ptr is NULL when index is not constructed.

  • Invariant:

    • ptr != NULL implies f > 0 and metric_id != METRIC_UNKNOWN.

Examples

>>> from annoy import Annoy, AnnoyIndex

High-level API:

>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
>>> from scikitplot.annoy import Annoy, AnnoyIndex, Index

The lifecycle follows the examples in test.ipynb:

  1. Construct the index

>>> import random; random.seed(0)
>>> # from annoy import AnnoyIndex
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
>>> from scikitplot.annoy import Annoy, AnnoyIndex, Index
>>> idx = Annoy(f=3, metric="angular")
>>> idx.f, idx.metric
(3, 'angular')

If you pass f=0 the dimension can be inferred on the first call to add_item.

  1. Add items

>>> idx.add_item(0, [1.0, 0.0, 0.0])
>>> idx.add_item(1, [0.0, 1.0, 0.0])
>>> idx.add_item(2, [0.0, 0.0, 1.0])
>>> idx.get_n_items()
3
  1. Build the forest

>>> idx.build(n_trees=-1)
>>> idx.get_n_trees()
10
>>> idx.memory_usage()  # byte
543076

After build the index becomes read-only. You can still query, save, load and serialize it.

  1. Query neighbours

By stored item id:

>>> idx.get_nns_by_item(0, 5)
[0, 1, 2, ...]

With distances:

>>> idx.get_nns_by_item(0, 5, include_distances=True)
([0, 1, 2, ...], [0.0, 1.22, 1.26, ...])

Or by an explicit query vector:

>>> idx.get_nns_by_vector([0.1, 0.2, 0.3], 5, include_distances=True)
([103, 71, 160, 573, 672], [...])
  1. Persistence

To work with memory-mapped indices on disk:

>>> idx.save("annoy_test.annoy")
>>> idx2 = Annoy(f=100, metric="angular")
>>> idx2.load("annoy_test.annoy")
>>> idx2.get_n_items()
1000

Or via raw byte:

>>> buf = idx.serialize()
>>> new_idx = Annoy(f=100, metric="angular")
>>> new_idx.deserialize(buf)
>>> new_idx.get_n_items()
1000

You can release OS resources with unload and drop the current forest with unbuild.

add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:
iint

Item id (index) must be non-negative. Ids may be non-contiguous; the index allocates up to max(i) + 1.

vectorsequence of float

1D embedding of length f. Values are converted to float. If f == 0 and this is the first item, f is inferred from vector and then fixed for the lifetime of this index.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Build the forest after adding items.

unbuild

Remove trees to allow adding more items.

get_nns_by_item, get_nns_by_vector

Query nearest neighbours.

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)
build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:
n_treesint

Number of trees in the forest. Larger values typically improve recall at the cost of slower build time and higher memory usage.

If set to n_trees=-1, trees are built dynamically until the index reaches approximately twice the number of items _n_nodes >= 2 * n_items.

Guidelines:

  • Small datasets (<10k samples): 10-20 trees.

  • Medium datasets (10k-1M samples): 20-50 trees.

  • Large datasets (>1M samples): 50-100+ trees.

n_jobsint, optional, default=-1

Number of threads to use while building. -1 means “auto” (use the implementation’s default, typically all available CPU cores).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

fit

Build the index from X (preferred if you already have X available).

add_item

Add vectors before building.

unbuild

Drop trees to add more items.

rebuild

Return a new Annoy index rebuilt from the current index contents.

on_disk_build

Configure on-disk build mode.

get_nns_by_item, get_nns_by_vector

Query nearest neighbours.

save, load

Persist the index to/from disk.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items, and then rebuild.

References

[1]

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)
>>> idx.build(10)
deserialize(byte, prefault=None)#

Restore the index from a serialized byte string.

Parameters:
bytebytes

Byte string produced by serialize. Both native (legacy) blobs and portable blobs (created with serialize(format='portable')) are accepted; portable and canonical blobs are auto-detected. Canonical blobs restore by rebuilding the index deterministically.

prefaultbool or None, optional, default=None

Accepted for API symmetry with load. If None, the stored Ignored for canonical blobs. prefault value is used.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If deserialization fails due to invalid or incompatible data.

RuntimeError

If the index is not initialized.

See also

serialize

Create a binary snapshot of the index.

on_disk_build

Configure on-disk build mode.

Notes

Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.

f#

Vector dimension.

Returns:
int

Dimension of each item vector. 0 means unknown / lazy.

Notes

  • Annoy(f=None, ...) is supported at construction time and is treated as f=0.

  • 0 (or None) means “unknown / lazy”: the first call to add_item will infer f from the input vector length and then fix it.

Changing f after the index has been initialized (items added and/or trees built) is a structural change: the stored items and all tree splits depend on the vector dimension.

For scikit-learn compatibility, assigning a different f (or None) on an already initialized index will deterministically reset the index (drop all items, trees, and y). You must call fit (or add_item + build) again before querying.

feature_names_in_#

Input feature names seen during fit (SLEP007). Set only when explicitly provided via fit(…, feature_names=…).

fit(X=None, y=None, \*, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None)#

Fit the Annoy index (scikit-learn compatible).

This method supports two deterministic workflows:

  1. Manual add/build: If X is None and y is None, fit() builds the forest using items previously added via add_item().

  2. Array-like X: If X is provided (2D array-like), fit() optionally resets or appends, adds all rows as items, then builds the forest.

Parameters:
Xarray-like of shape (n_samples, n_features), default=None

Vectors to add to the index. If None (and y is None), fit() only builds.

yarray-like of shape (n_samples,), default=None

Optional labels associated with X. Stored as y after successful build.

n_treesint, default=-1

Number of trees to build. Use -1 for Annoy’s internal default.

n_jobsint, default=-1

Number of threads to use during build (-1 means “auto”).

resetbool, default=True

If True, clear existing items before adding X. If False, append.

start_indexint or None, default=None

Item id for the first row of X. If None, uses 0 when reset=True, otherwise uses current n_items when reset=False.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X.

  • Dense rows: replaces None elements with missing_value.

  • Dict rows: fills missing keys (and None values) with missing_value.

If None, missing entries raise an error (strict mode).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

fit_transform

Estimator-style APIs.

transform

Query the built index.

add_item

Add one item at a time.

build

Build the forest after manual calls to add_item.

on_disk_build

Configure on-disk build mode.

unbuild

Remove trees so items can be appended.

y

Stored labels y (if provided).

get_params, set_params

Estimator parameter API.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx = AnnoyIndex().set_params(metric=m).fit(X)
...     print(m, idx.transform(q))
...
>>> idx = AnnoyIndex().fit(X)
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx_m = base.rebuild(metric=m)  # rebuild-from-index
...     print(m, idx_m.transform(q))  # no .fit(X) here
fit_transform(X, y=None, \*, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None, n_neighbors=None, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None)#

Fit the index and transform X in a single deterministic call.

This is equivalent to:

self.fit(X, y=y, n_trees=…, n_jobs=…, reset=…, start_index=…, missing_value=…) self.transform(X, n_neighbors=…, search_k=…, include_distances=…, return_labels=…, y_fill_value=…, missing_value=…)

See also

fit

Build the index from X (preferred if you already have X available).

transform

Query the built index.

on_disk_build

Configure on-disk build mode.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     print(m, AnnoyIndex().set_params(metric=m).fit_transform(q))
get_distance(i, j) float#

Return the distance between two stored items.

Parameters:
i, jint

Item ids (index) of two stored samples.

Returns:
dfloat

Distance between items i and j under the current metric.

Raises:
RuntimeError

If the index is not initialized.

IndexError

If either index is out of range.

get_feature_names_out(input_features=None)#

Get output feature names for the transformer-style API.

Parameters:
input_featuressequence of str or None, optional, default=None

If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.

Returns:
tuple of str

Output feature names: ('neighbor_0', ..., 'neighbor_{k-1}') where k == n_neighbors.

Raises:
AttributeError

If called before fit/build.

ValueError

If input_features is provided but does not match feature_names_in_.

get_item_vector(i) list[float]#

Return the stored embedding vector for a given item id.

Parameters:
iint

Item id (index) previously passed to add_item.

Returns:
vectorlist[float]

Stored embedding of length f.

Raises:
RuntimeError

If the index is not initialized.

IndexError

If i is out of range.

get_n_items() int#

Return the number of stored items in the index.

Returns:
n_itemsint

Number of items that have been added and are currently addressable.

Raises:
RuntimeError

If the index is not initialized.

get_n_trees() int#

Return the number of trees in the current forest.

Returns:
n_treesint

Number of trees that have been built.

Raises:
RuntimeError

If the index is not initialized.

get_nns_by_item(i, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a stored item id.

Parameters:
iint

Item id (index) previously passed to add_item(i, embedding).

nint

Number of nearest neighbours to return.

search_kint, optional, default=-1

Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional, default=False

If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:
indiceslist[int] | tuple[list[int], list[float]]

If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:
RuntimeError

If the index is not initialized or has not been built.

IndexError

If i is out of range.

See also

get_nns_by_vector

Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a query embedding.

Parameters:
vectorsequence of float

Query embedding of length f.

nint

Number of nearest neighbours to return.

search_kint, optional, default=-1

Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional, default=False

If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:
indiceslist[int] | tuple[list[int], list[float]]

If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:
RuntimeError

If the index is not initialized or has not been built.

ValueError

If len(vector) != f.

See also

get_nns_by_item

Query by stored item id.

get_params(deep=True) dict#

Return estimator-style parameters (scikit-learn compatibility).

Parameters:
deepbool, optional, default=True

Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.

Returns:
paramsdict

Dictionary of stable, user-facing parameters.

See also

set_params

Set estimator-style parameters.

schema_version

Controls pickle / snapshot strategy.

Notes

This is intended to make Annoy behave like a scikit-learn estimator for tools such as sklearn.base.clone and parameter grids.

info(include_n_items=True, include_n_trees=True, include_memory=None) dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Parameters:
include_n_itemsbool, optional, default=True

If True, include n_items.

include_n_treesbool, optional, default=True

If True, include n_trees.

include_memorybool or None, optional, default=None

Controls whether memory usage fields are included.

  • None: include memory usage only if the index is built.

  • True: include memory usage if available (built).

  • False: omit memory usage fields.

Memory usage is computed after build and may be expensive for very large indexes.

Returns:
infodict

Dictionary describing the current index state.

See also

serialize

Create a binary snapshot of the index.

deserialize

Restore from a binary snapshot.

save

Persist the index to disk.

load

Load the index from disk.

Notes

  • Some keys are optional depending on include_* flags.

Keys:

  • fint, default=0

    Dimensionality of the index.

  • metricstr, default=’angular’

    Distance metric name.

  • on_disk_pathstr, default=’’

    Path used for on-disk build, if configured.

  • prefaultbool, default=False

    If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.

  • schema_versionint, default=0

    Stored schema/version marker on this object (reserved for future use).

  • seedint or None, optional, default=None

    Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.

  • verboseint or None, optional, default=None

    Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

    • <= 0 : quiet (warnings only)

    • 1 : info (Annoy’s verbose=True)

    • >= 2 : debug (currently same as info, reserved for future use)

Optional Keys:

  • n_itemsint

    Number of items currently stored.

  • n_treesint

    Number of built trees in the forest.

  • memory_usage_byteint

    Approximate memory usage in bytes. Present only when requested and available.

  • memory_usage_mibfloat

    Approximate memory usage in MiB. Present only when requested and available.

Examples

>>> info = idx.info()
>>> info['f']
100
>>> info['n_items']
1000
load(fn, prefault=None)#

Load (mmap) an index from disk into the current object.

Parameters:
fnstr

Path to a file previously created by save or on_disk_build.

prefaultbool or None, optional, default=None

If True, fault pages into memory when the file is mapped. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If the file cannot be opened or mapped.

RuntimeError

If the index is not initialized or the file is incompatible.

See also

save

Save the current index to disk.

on_disk_build

Build directly using an on-disk backing file.

unload

Release mmap resources.

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

memory_usage() int#

Approximate memory usage of the index in bytes.

Returns:
n_bytesint or None

Approximate number of bytes used by the index. Returns None if the index is not initialized or the forest has not been built yet.

Raises:
RuntimeError

If memory usage cannot be computed.

metric#

Distance metric for the index. Valid values:

  • ‘angular’ -> Cosine-like distance on normalized vectors.

  • ‘euclidean’ -> L2 distance.

  • ‘manhattan’ -> L1 distance.

  • ‘dot’ -> Negative dot-product distance (inner product).

  • ‘hamming’ -> Hamming distance for binary vectors.

Aliases (case-insensitive):

  • angular : cosine

  • euclidean : l2, lstsq

  • manhattan : l1, cityblock, taxicab

  • dot : @, ., dotproduct, inner, innerproduct

  • hamming : hamming

Returns:
str or None

Canonical metric name, or None if not configured yet.

Notes

Changing metric after the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.

For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and y). You must call fit (or add_item + build) again before querying.

n_features#

Alias of f (dimension), provided for scikit-learn naming parity.

n_features_#

Read-only alias of n_features_in_.

n_features_in_#

Number of features seen during fit (scikit-learn compatible). Alias of f when available.

n_features_out_#

Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.

n_neighbors#

Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_build(fn)#

Configure the index to build using an on-disk backing file.

Parameters:
fnstr

Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Build trees after adding items (on-disk backed).

rebuild

Return a new Annoy index rebuilt from the current index contents.

fit

Build the index from X (preferred if you already have X available).

load

Memory-map the built index.

save

Persist the built index to disk.

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

on_disk_path#

Path used for on-disk build/load/save operations.

Returns:
str or None

Filesystem path used for on-disk operations, or None if not configured.

Notes

  • Assigning a string/PathLike to on_disk_path configures on-disk build mode (equivalent to calling on_disk_build with the same filename).

  • Note: Annoy core truncates the target file when enabling on-disk build. on_disk_path is strictly equivalent to calling on_disk_build with the same filename (truncate allowed).

  • Assigning None (or an empty string) clears the configured path, but only when no disk-backed index is currently active.

  • Clearing/changing this while an on-disk index is active is disallowed. Call unload first.

prefault#

Default prefault flag stored on the object.

This setting is used as the default for per-call prefault arguments when prefault is omitted or set to None in methods like load and save.

Returns:
bool

Current prefault flag.

Notes

  • This flag does not retroactively change already-loaded mappings.

random_state#

Alias of seed (scikit-learn convention).

rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) Annoy#

Return a new Annoy index rebuilt from the current index contents.

This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.

Parameters:
metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional

Metric for the new index. If None, reuse the current metric.

on_disk_pathpath-like or None, optional

Optional on-disk build path for the new index.

Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.

n_treesint or None, optional

If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.

n_jobsint, optional, default=-1

Number of threads to use while building (-1 means “auto”).

Returns:
Annoy

A new Annoy instance containing the same items (and y metadata if present).

See also

build

Build trees after adding items (on-disk backed).

on_disk_build

Configure on-disk build mode.

fit

Build the index from X (preferred if you already have X available).

get_params

Read constructor parameters.

set_params

Update estimator parameters (use with fit(X) when refitting from data).

serialize, deserialize

Persist / restore indexes; canonical restores rebuild deterministically.

__sklearn_clone__

Unfitted clone hook (no fitted state).

Notes

rebuild(metric=...) is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.

Use rebuild() when you want to change metric while reusing the already-stored vectors (e.g., you do not want to re-read or re-materialize X, or you loaded an index from disk and only have access to its stored vectors).

repr_info(include_n_items=True, include_n_trees=True, include_memory=None) str#

Return a dict-like string representation with optional extra fields.

Unlike __repr__, this method can include additional fields on demand. Note that include_memory=True may be expensive for large indexes. Memory is calculated after build.

save(fn, prefault=None)#

Persist the index to a binary file on disk.

Parameters:
fnstr

Path to the output file. Existing files will be overwritten.

prefaultbool or None, optional, default=None

If True, aggressively fault pages into memory during save. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If the file cannot be written.

RuntimeError

If the index is not initialized or save fails.

See also

load

Load an index from disk.

on_disk_build

Configure on-disk build mode.

serialize

Snapshot to bytes for in-memory persistence.

deserialize

Restore an index from a serialized byte string.

Notes

The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored prefault setting.

schema_version#

Serialization/compatibility strategy marker sentinel value.

This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.

Returns:
int

Current schema version marker.

Notes

  • 0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).

  • 2: pickle stores canonical-v1 (portable; restores by rebuilding deterministically).

  • >=3: pickle stores both portable and canonical; canonical is used as a fallback.

seed#

Random seed override (scikit-learn compatible). None means use Annoy default seed.

serialize(format=None) bytes#

Serialize the built in-memory index into a byte string.

Parameters:
format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format.

  • “native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.

  • “portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.

  • “canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:
databytes

Opaque binary blob containing the Annoy index.

Raises:
RuntimeError

If the index is not initialized or serialization fails.

OverflowError

If the serialized payload is too large to fit in a Python bytes object.

See also

deserialize

Restore an index from a serialized byte string.

on_disk_build

Configure on-disk build mode.

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

set_params(**params) Annoy#

Set estimator-style parameters (scikit-learn compatibility).

Parameters:
**params

Keyword parameters to set. Unknown keys raise ValueError.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
ValueError

If an unknown parameter name is provided.

TypeError

If parameter names are not strings or types are invalid.

See also

get_params

Return estimator-style parameters.

Notes

Changing structural parameters (notably metric) on an already initialized index resets the index deterministically (drops all items, trees, and y). Refit/rebuild is required before querying.

This behavior matches scikit-learn expectations: set_params may be called at any time, but parameter changes that affect learned state invalidate the fitted model.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:
seedint or None, optional, default=None

Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 resets to Annoy’s core default seed (with a UserWarning).

  • If omitted (or None, NULL), the seed is set to Annoy’s default seed.

  • If 0, clear any pending override and reset to Annoy’s default seed (a UserWarning is emitted).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

seed

Parameter attribute (int | None).

Notes

Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.

set_verbose(level=1)#

Set the verbosity level (callable setter).

This method exists to preserve a callable interface while keeping the parameter name verbose available as an attribute for scikit-learn compatibility.

Parameters:
levelint, optional, default=1

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

Returns:
Annoy

This instance (self), enabling method chaining.

See also

verbose

Parameter attribute (int | None).

set_verbosity

Alias of set_verbose.

get_params, set_params

Estimator parameter API.

set_verbosity(level=1)#

Alias of set_verbose.

See also

verbose

Parameter attribute (int | None).

set_verbose

Set the verbosity level (callable setter).

transform(X, \*, n_neighbors=5, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None, input_type='vector', output_type='vector', exclude_self=False, exclude_items=None, missing_value=None)#

Transform queries into nearest-neighbor results (ids or vectors; optional distances / labels).

Parameters:
Xarray-like

Query inputs. The expected shape/type depends on input_type:

  • input_type=’item’ : X must be a 1D sequence of item ids.

  • input_type=’vector’: X must be a 2D array-like of shape (n_queries, f).

n_neighborsint or None, default=5

Number of neighbors to retrieve for each query. For backwards compatibility this keyword is accepted, but it must match the estimator parameter n_neighbors (STRICT schema).

search_kint, default=-1

Search parameter passed to Annoy (-1 uses Annoy’s default).

include_distancesbool, default=False

If True, also return per-neighbor distances.

return_labelsbool, default=False

If True, also return per-neighbor labels resolved from y (as set via fit).

y_fill_valueobject, default=None

Value used when y is unset or missing an entry for a neighbor id.

input_type{‘vector’, ‘item’}, default=’vector’

Controls how X is interpreted.

output_type{‘vector’, ‘item’}, default=’vector’

Controls what neighbors are returned.

  • output_type=’item’: return neighbor ids.

  • output_type=’vector’: return neighbor vectors.

exclude_selfbool, default=False

If True, exclude the query item id from results. Only valid when input_type=’item’.

exclude_itemssequence of int or None, default=None

Explicit neighbor ids to exclude from results.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X (None values in dense rows; missing keys / None values in dict rows). If None, missing entries raise.

Returns:
neighborslist

Neighbor results for each query. - output_type=’item’ : list of list of int - output_type=’vector’: list of list of list of float

(neighbors, distances)tuple

Returned when include_distances=True.

(neighbors, labels)tuple

Returned when return_labels=True.

(neighbors, distances, labels)tuple

Returned when include_distances=True and return_labels=True.

See also

get_nns_by_item

Neighbor search by item id.

get_nns_by_vector

Neighbor search by query vector.

fit

Build the index from X (preferred if you already have X available).

fit_transform

Estimator-style APIs.

Notes

  • Excluding self is performed by matching neighbor ids to the query id (not by checking distance values).

  • For input_type=’vector’, exclude_self=True is an error; use exclude_items for explicit, deterministic filtering.

  • If exclusions prevent returning exactly n_neighbors results, this method raises ValueError.

Examples

Item queries (exclude the query id itself):

>>> idx.transform([10, 20], input_type='item', output_type='item', n_neighbors=5, exclude_self=True)

Vector queries (exclude explicit ids):

>>> idx.transform(X_query, input_type='vector', output_type='item', n_neighbors=5, exclude_items=[10, 20])

Return neighbor vectors:

>>> idx.transform([10], input_type='item', output_type='vector', n_neighbors=5, exclude_self=True)
unbuild()#

Discard the current forest, allowing new items to be added.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Rebuild the forest after adding new items.

rebuild

Return a new Annoy index rebuilt from the current index contents.

fit

Build the index from X (preferred if you already have X available).

add_item

Add items (only valid when no trees are built).

Notes

After calling unbuild, you must call build again before running nearest-neighbour queries.

unload()#

Unmap any memory-mapped file backing this index.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

load

Memory-map an on-disk index into this object.

on_disk_build

Configure on-disk build mode.

Notes

This releases OS-level resources associated with the mmap, but keeps the Python object alive.

verbose#

set_verbose().

Type:

Verbosity level in [-2, 2] or None (unset). Callable setter

y#

Labels / targets associated with the index items.

Notes

If provided to fit(X, y), labels are stored here after a successful build. You may also set this property manually. When possible, the setter enforces that len(y) matches the current number of items (n_items).