Index#

class scikitplot.annoy.Index[source]#

High-level ANNoy index composed from mixins.

Parameters:
fint or None, optional, default=None

Vector dimension. If 0 or None, dimension may be inferred from the first vector passed to add_item (lazy mode). If None, treated as 0 (reset to default).

metric{“angular”, “cosine”, “euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”} or None, optional, default=None

Distance metric (one of ‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’). If omitted and f > 0, defaults to 'angular' (cosine-like). If omitted and f == 0, metric may be set later before construction. If None, behavior depends on f:

  • If f > 0: defaults to 'angular' (legacy behavior; may emit a

FutureWarning). * If f == 0: leaves the metric unset (lazy). You may set metric later before construction, or it will default to 'angular' on first add_item.

n_neighborsint, default=5

Non-negative integer Number of neighbors to retrieve for each query.

on_disk_pathstr or None, optional, default=None

If provided, configures the path for on-disk building. When the underlying index exists, this enables on-disk build mode (equivalent to calling on_disk_build with the same filename).

Note: Annoy core truncates the target file when enabling on-disk build. This wrapper treats on_disk_path as strictly equivalent to calling on_disk_build with the same filename (truncate allowed).

In lazy mode (f==0 and/or metric is None), activation occurs once the underlying C++ index is created.

prefaultbool or None, optional, default=None

If True, request page-faulting index pages into memory when loading (when supported by the underlying platform/backing). If None, treated as False (reset to default).

seedint or None, optional, default=None

Non-negative integer seed. If set before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 is treated as “use Annoy’s deterministic default seed” (a UserWarning is emitted when 0 is explicitly provided).

verboseint or None, optional, default=None

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

schema_versionint, optional, default=None

Serialization/compatibility strategy marker.

This does not change the Annoy on-disk format, but it does control how the index is snapshotted in pickles.

  • 0 or 1: pickle stores a portable-v1 snapshot (fast restore,

ABI-checked). * 2: pickle stores canonical-v1 (portable across ABIs; restores by rebuilding deterministically). * >=3: pickle stores both portable and canonical (canonical is used as a fallback if the ABI check fails).

If None, treated as 0 (reset to default).

Attributes:
fint, default=0

Vector dimension.

metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’}, default=”angular”

Distance metric for the index.

n_neighborsint, default=5

Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_pathstr or None, optional, default=None

Path used for on-disk build/load/save operations.

seed, random_stateint or None, optional, default=None

Non-negative integer seed.

verboseint or None, optional, default=None

Verbosity level in [-2, 2] or None (unset).

prefaultbool, default=False

Default prefault flag stored on the object.

schema_versionint, default=0

Serialization/compatibility strategy marker sentinel value.

n_features, n_features_, n_features_in_int

Alias of f (dimension), provided for scikit-learn naming parity.

n_features_out_int

Number of output features produced by transform (SLEP013).

feature_names_in_list-like

Input feature names seen during fit (SLEP007).

ydict | None, optional, default=None

Labels / targets associated with the index items.

pickle_modePickleMode

Pickle strategy used by PickleMixin.

compress_modeCompressMode or None

Optional compression used by PickleMixin when serializing to bytes.

Notes

This class is a direct subclass of the C-extension backend. It does not override __new__ and does not rely on cooperative initialization across mixins. Mixins must be written so that their methods work even if they define no __init__ at all.

add_item(i, vector)#

Add a single embedding vector to the index.

Parameters:
iint

Item id (index) must be non-negative. Ids may be non-contiguous; the index allocates up to max(i) + 1.

vectorsequence of float

1D embedding of length f. Values are converted to float. If f == 0 and this is the first item, f is inferred from vector and then fixed for the lifetime of this index.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Build the forest after adding items.

unbuild

Remove trees to allow adding more items.

get_nns_by_item, get_nns_by_vector

Query nearest neighbours.

Notes

Items must be added before calling build. After building the forest, further calls to add_item are not supported.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)
add_items(X, ids=None, *, start_id=None, accept_sparse='error', ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, order='C', check_unique_ids=True)[source]#

Add many vectors to the index.

Parameters:
Xarray-like of shape (n_samples, n_features)

Vectors to add.

idsarray-like of shape (n_samples,), optional

Explicit integer ids. If omitted, ids are allocated as a contiguous range starting at start_id (or get_n_items() at call time).

start_idint, optional

Starting id used when ids is None. If None, defaults to backend.get_n_items() at call time.

accept_sparse{‘error’, ‘toarray’}, default=’error’

Sparse input handling. 'toarray' densifies SciPy sparse inputs explicitly. Any other sparse behavior raises.

ensure_all_finitebool or ‘allow-nan’, default=True

Finiteness validation policy.

copybool, default=False

If True, copy the validated dense array before adding.

dtypenumpy dtype, default=numpy.float32

Dtype passed to the backend.

order{‘C’, ‘F’, ‘A’, ‘K’}, default=’C’

Memory order used when coercing X.

check_unique_idsbool, default=True

If True, require ids to be unique.

Returns:
ids_outnumpy.ndarray of shape (n_samples,)

The ids that were added, as int64.

Raises:
RuntimeError

If the backend indicates the index is built.

TypeError

If sparse input is given while accept_sparse='error'.

ValueError

If X is not 2D, feature dimensions mismatch f, ids are invalid, or finiteness policy is violated.

Parameters:
Return type:

ndarray

See also

get_item_vectors

Fetch vectors by id selection.

to_numpy

Export vectors as a dense NumPy array.

Notes

This method is deterministic: ids are generated predictably and vectors are added in row order.

property backend: Annoy#

Public alias for _backend.

Returns:
backendscikitplot.cexternals._annoy.Annoy

Low-level Annoy backend instance.

build(n_trees, n_jobs=-1)#

Build a forest of random projection trees.

Parameters:
n_treesint

Number of trees in the forest. Larger values typically improve recall at the cost of slower build time and higher memory usage.

If set to n_trees=-1, trees are built dynamically until the index reaches approximately twice the number of items _n_nodes >= 2 * n_items.

Guidelines:

  • Small datasets (<10k samples): 10-20 trees.

  • Medium datasets (10k-1M samples): 20-50 trees.

  • Large datasets (>1M samples): 50-100+ trees.

n_jobsint, optional, default=-1

Number of threads to use while building. -1 means “auto” (use the implementation’s default, typically all available CPU cores).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

fit

Build the index from X (preferred if you already have X available).

add_item

Add vectors before building.

unbuild

Drop trees to add more items.

rebuild

Return a new Annoy index rebuilt from the current index contents.

on_disk_build

Configure on-disk build mode.

get_nns_by_item, get_nns_by_vector

Query nearest neighbours.

save, load

Persist the index to/from disk.

Notes

After build completes, the index becomes read-only for queries. To add more items, call unbuild, add items, and then rebuild.

References

[1]

Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> f=100
>>> n=1000
>>> idx = AnnoyIndex(f, metric='l2')
...
>>> for i in range(n):
...    v = [random.gauss(0, 1) for _ in range(f)]
...    idx.add_item(i, v)
>>> idx.build(10)
property compress_mode: Literal['zlib', 'gzip'] | None#

!! processed by numpydoc !!

deserialize(byte, prefault=None)#

Restore the index from a serialized byte string.

Parameters:
bytebytes

Byte string produced by serialize. Both native (legacy) blobs and portable blobs (created with serialize(format='portable')) are accepted; portable and canonical blobs are auto-detected. Canonical blobs restore by rebuilding the index deterministically.

prefaultbool or None, optional, default=None

Accepted for API symmetry with load. If None, the stored Ignored for canonical blobs. prefault value is used.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If deserialization fails due to invalid or incompatible data.

RuntimeError

If the index is not initialized.

See also

serialize

Create a binary snapshot of the index.

on_disk_build

Configure on-disk build mode.

Notes

Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.

f#

Vector dimension.

Returns:
int

Dimension of each item vector. 0 means unknown / lazy.

Notes

  • Annoy(f=None, ...) is supported at construction time and is treated as f=0.

  • 0 (or None) means “unknown / lazy”: the first call to add_item will infer f from the input vector length and then fix it.

Changing f after the index has been initialized (items added and/or trees built) is a structural change: the stored items and all tree splits depend on the vector dimension.

For scikit-learn compatibility, assigning a different f (or None) on an already initialized index will deterministically reset the index (drop all items, trees, and y). You must call fit (or add_item + build) again before querying.

feature_names_in_#

Input feature names seen during fit (SLEP007). Set only when explicitly provided via fit(…, feature_names=…).

fit(X=None, y=None, *, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None)#

Fit the Annoy index (scikit-learn compatible).

This method supports two deterministic workflows:

  1. Manual add/build: If X is None and y is None, fit() builds the forest using items previously added via add_item().

  2. Array-like X: If X is provided (2D array-like), fit() optionally resets or appends, adds all rows as items, then builds the forest.

Parameters:
Xarray-like of shape (n_samples, n_features), default=None

Vectors to add to the index. If None (and y is None), fit() only builds.

yarray-like of shape (n_samples,), default=None

Optional labels associated with X. Stored as y after successful build.

n_treesint, default=-1

Number of trees to build. Use -1 for Annoy’s internal default.

n_jobsint, default=-1

Number of threads to use during build (-1 means “auto”).

resetbool, default=True

If True, clear existing items before adding X. If False, append.

start_indexint or None, default=None

Item id for the first row of X. If None, uses 0 when reset=True, otherwise uses current n_items when reset=False.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X.

  • Dense rows: replaces None elements with missing_value.

  • Dict rows: fills missing keys (and None values) with missing_value.

If None, missing entries raise an error (strict mode).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

fit_transform

Estimator-style APIs.

transform

Query the built index.

add_item

Add one item at a time.

build

Build the forest after manual calls to add_item.

on_disk_build

Configure on-disk build mode.

unbuild

Remove trees so items can be appended.

y

Stored labels y (if provided).

get_params, set_params

Estimator parameter API.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx = AnnoyIndex().set_params(metric=m).fit(X)
...     print(m, idx.transform(q))
...
>>> idx = AnnoyIndex().fit(X)
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     idx_m = base.rebuild(metric=m)  # rebuild-from-index
...     print(m, idx_m.transform(q))  # no .fit(X) here
fit_transform(X, y=None, *, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None, n_neighbors=None, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None)#

Fit the index and transform X in a single deterministic call.

This is equivalent to:

self.fit(X, y=y, n_trees=…, n_jobs=…, reset=…, start_index=…, missing_value=…) self.transform(X, n_neighbors=…, search_k=…, include_distances=…, return_labels=…, y_fill_value=…, missing_value=…)

See also

fit

Build the index from X (preferred if you already have X available).

transform

Query the built index.

on_disk_build

Configure on-disk build mode.

Examples

>>> import random
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex
...
>>> n, f = 10_000, 1_000
>>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)]
>>> q = [[random.gauss(0, 1) for _ in range(f)]]
...
>>> for m in ['angular', 'l1', 'l2', '.', 'hamming']:
...     print(m, AnnoyIndex().set_params(metric=m).fit_transform(q))
classmethod from_bytes(data, *, f=None, metric=None, prefault=None)[source]#

Construct a new index and load it from serialized bytes.

Parameters:
data

Bytes produced by to_bytes (backend serialize).

f

Vector dimension for construction.

metric

Metric name for construction.

prefault

Forwarded to the backend deserialize if supported.

Returns:
index

Newly constructed index with the data loaded.

Raises:
TypeError

If data is not bytes-like.

ValueError

If f or metric is invalid.

AttributeError

If the backend does not provide deserialize.

Parameters:
Return type:

Self

Notes

Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.

For data if fed to_bytes(format='native') required params ``f`, metric.

classmethod from_json(path, *, load=True)[source]#

Load metadata from JSON and construct an index.

Parameters:
Return type:

Self

classmethod from_low_level(obj, *, prefault=None)[source]#

Create a new Index from a low-level instance.

The new object is rebuilt by round-tripping through Annoy’s native serialize / deserialize to avoid sharing low-level state between two Python objects.

Parameters:
objscikitplot.cexternals._annoy.Annoy

Low-level Annoy instance.

prefaultbool or None, default=None

Prefault override passed to deserialize. If None, the value is taken from obj.get_params(deep=False) when available, otherwise it falls back to obj.prefault / destination defaults.

Returns:
indexIndex

Newly constructed high-level index.

Raises:
TypeError

If obj is not an Annoy instance.

RuntimeError

If serialization or deserialization fails, or required configuration (e.g., f) cannot be determined.

Parameters:
Return type:

Self

Notes

The implementation uses Annoy’s native serialization. It does not attempt to copy internal pointers or C++ state directly.

This method is deterministic. It always constructs a new index from the serialized payload; it does not share low-level state between objects.

classmethod from_metadata(metadata, *, load=True)[source]#

Construct an index from a metadata payload.

Parameters:
metadataMapping[str, Any]

Payload as produced by to_metadata.

loadbool, default=True

If True and params['on_disk_path'] is present, attempt to load the index into the returned object via backend load.

Returns:
indexSelf

Newly constructed index.

Raises:
TypeError

If input types are invalid.

ValueError

If required fields are missing or invalid.

RuntimeError

If schema version is missing on the class.

AttributeError

If backend set_params/load are missing when required.

Parameters:
Return type:

Self

classmethod from_yaml(path, *, load=True)[source]#

Load metadata from YAML and construct an index (requires PyYAML).

Parameters:
Return type:

Self

get_distance(i, j) float#

Return the distance between two stored items.

Parameters:
i, jint

Item ids (index) of two stored samples.

Returns:
dfloat

Distance between items i and j under the current metric.

Raises:
RuntimeError

If the index is not initialized.

IndexError

If either index is out of range.

get_feature_names_out(input_features=None)#

Get output feature names for the transformer-style API.

Parameters:
input_featuressequence of str or None, optional, default=None

If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.

Returns:
tuple of str

Output feature names: ('neighbor_0', ..., 'neighbor_{k-1}') where k == n_neighbors.

Raises:
AttributeError

If called before fit/build.

ValueError

If input_features is provided but does not match feature_names_in_.

get_item_vector(i) list[float]#

Return the stored embedding vector for a given item id.

Parameters:
iint

Item id (index) previously passed to add_item.

Returns:
vectorlist[float]

Stored embedding of length f.

Raises:
RuntimeError

If the index is not initialized.

IndexError

If i is out of range.

get_item_vectors(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, return_ids=False, validate_vector_len=True)[source]#

Fetch many vectors as a dense NumPy array.

Parameters:
idssequence of int or iterable of int, optional

Ids to fetch. If None, selects range(start, stop or n_items).

dtypenumpy dtype, default=numpy.float32

Output dtype.

start, stopint, optional

Range selection used when ids is None.

n_rowsint, optional

Required when ids is a non-sized iterable (e.g., generator).

return_idsbool, default=False

If True, also return the realized ids (int64) in row order.

validate_vector_lenbool, default=True

If True, verify every fetched vector has length f.

Returns:
Xnumpy.ndarray of shape (n_rows, f)

Dense matrix of vectors.

ids_outnumpy.ndarray of shape (n_rows,), optional

Returned when return_ids=True.

Raises:
ValueError

If the id selection is inconsistent or vectors have unexpected length.

TypeError

If ids is a non-sized iterable and n_rows is not provided.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

to_numpy

Dense NumPy export alias.

iter_item_vectors

Streaming export without allocating a dense matrix.

get_n_items() int#

Return the number of stored items in the index.

Returns:
n_itemsint

Number of items that have been added and are currently addressable.

Raises:
RuntimeError

If the index is not initialized.

get_n_trees() int#

Return the number of trees in the current forest.

Returns:
n_treesint

Number of trees that have been built.

Raises:
RuntimeError

If the index is not initialized.

get_nns_by_item(i, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a stored item id.

Parameters:
iint

Item id (index) previously passed to add_item(i, embedding).

nint

Number of nearest neighbours to return.

search_kint, optional, default=-1

Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional, default=False

If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:
indiceslist[int] | tuple[list[int], list[float]]

If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:
RuntimeError

If the index is not initialized or has not been built.

IndexError

If i is out of range.

See also

get_nns_by_vector

Query with an explicit query embedding.

get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#

Return the n nearest neighbours for a query embedding.

Parameters:
vectorsequence of float

Query embedding of length f.

nint

Number of nearest neighbours to return.

search_kint, optional, default=-1

Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If -1, defaults to approximately n_trees * n.

include_distancesbool, optional, default=False

If True, return a (indices, distances) tuple. Otherwise return only the list of indices.

Returns:
indiceslist[int] | tuple[list[int], list[float]]

If include_distances=False: list of neighbour item ids. If include_distances=True: (indices, distances).

Raises:
RuntimeError

If the index is not initialized or has not been built.

ValueError

If len(vector) != f.

See also

get_nns_by_item

Query by stored item id.

get_params(deep=True) dict#

Return estimator-style parameters (scikit-learn compatibility).

Parameters:
deepbool, optional, default=True

Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.

Returns:
paramsdict

Dictionary of stable, user-facing parameters.

See also

set_params

Set estimator-style parameters.

schema_version

Controls pickle / snapshot strategy.

Notes

This is intended to make Annoy behave like a scikit-learn estimator for tools such as sklearn.base.clone and parameter grids.

info(include_n_items=True, include_n_trees=True, include_memory=None) dict#

Return a structured summary of the index.

This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.

Parameters:
include_n_itemsbool, optional, default=True

If True, include n_items.

include_n_treesbool, optional, default=True

If True, include n_trees.

include_memorybool or None, optional, default=None

Controls whether memory usage fields are included.

  • None: include memory usage only if the index is built.

  • True: include memory usage if available (built).

  • False: omit memory usage fields.

Memory usage is computed after build and may be expensive for very large indexes.

Returns:
infodict

Dictionary describing the current index state.

See also

serialize

Create a binary snapshot of the index.

deserialize

Restore from a binary snapshot.

save

Persist the index to disk.

load

Load the index from disk.

Notes

  • Some keys are optional depending on include_* flags.

Keys:

  • fint, default=0

    Dimensionality of the index.

  • metricstr, default=’angular’

    Distance metric name.

  • on_disk_pathstr, default=’’

    Path used for on-disk build, if configured.

  • prefaultbool, default=False

    If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.

  • schema_versionint, default=0

    Stored schema/version marker on this object (reserved for future use).

  • seedint or None, optional, default=None

    Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.

  • verboseint or None, optional, default=None

    Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

    • <= 0 : quiet (warnings only)

    • 1 : info (Annoy’s verbose=True)

    • >= 2 : debug (currently same as info, reserved for future use)

Optional Keys:

  • n_itemsint

    Number of items currently stored.

  • n_treesint

    Number of built trees in the forest.

  • memory_usage_byteint

    Approximate memory usage in bytes. Present only when requested and available.

  • memory_usage_mibfloat

    Approximate memory usage in MiB. Present only when requested and available.

Examples

>>> info = idx.info()
>>> info['f']
100
>>> info['n_items']
1000
iter_item_vectors(ids=None, *, start=0, stop=None, with_ids=True, dtype=None)[source]#

Iterate vectors without allocating a dense matrix.

Parameters:
ids, start, stop

Selection controls. See get_item_vectors.

with_idsbool, default=True

If True, yield (id, vector). If False, yield vectors only.

dtypenumpy dtype, optional

If provided, cast output vectors to this dtype.

Yields:
(id, vector) or vector

Each vector is returned as a 1D NumPy array.

Parameters:
Return type:

Iterator[ndarray | tuple[int, ndarray]]

See also

get_item_vectors

Dense export.

kneighbors(X, n_neighbors=5, *, search_k=-1, include_distances=True, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='vector')[source]#

Find k nearest neighbors for one or more query vectors.

This is a sklearn-like convenience wrapper that returns rectangular arrays.

Parameters:
Xarray-like of shape (f,) or (n_queries, f)

Query vector(s).

n_neighborsint, default=5

Number of neighbors to return per query.

search_kint, default=-1

Search parameter forwarded to the backend.

include_distancesbool, default=True

If True, return (neighbors, distances). Otherwise return neighbors.

exclude_selfbool, default=False

If True, apply the same deterministic self-exclusion rule as query_by_vector for each query row.

exclude_item_idsiterable of int, optional

Exclude these ids for every query.

ensure_all_finitebool or ‘allow-nan’, default=True

Input validation option forwarded to scikit-learn.

copybool, default=False

Input validation option forwarded to scikit-learn.

output_type{‘item’, ‘vector’}, default=’vector’

If ‘item’, return neighbor ids. If ‘vector’, return neighbor vectors.

Returns:
neighborsnumpy.ndarray

If output_type='item', shape is (n_queries, n_neighbors). If output_type='vector', shape is (n_queries, n_neighbors, f).

distancesnumpy.ndarray of shape (n_queries, n_neighbors)

Neighbor distances. Returned when include_distances=True.

Raises:
sklearn.exceptions.NotFittedError

If the backend reports that the index is unbuilt.

ValueError

If n_neighbors <= 0 or any query yields too few neighbors after exclusions.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

query_by_vector

Per-query 1D interface.

kneighbors_graph

CSR kNN graph.

kneighbors_graph(X, n_neighbors=5, *, search_k=-1, mode='connectivity', exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='item')[source]#

Compute the k-neighbors graph (CSR) for query vectors.

Parameters:
Xarray-like of shape (f,) or (n_queries, f)

Query vector(s).

n_neighborsint, default=5

Number of neighbors per query.

search_kint, default=-1

Search parameter forwarded to the backend.

mode{‘connectivity’, ‘distance’}, default=’connectivity’

If ‘connectivity’, graph entries are 1. If ‘distance’, entries are backend distances.

exclude_selfbool, default=False

If True, apply the same deterministic self-exclusion rule as kneighbors for each query row.

exclude_item_idsiterable of int, optional

Exclude these ids for every query.

ensure_all_finitebool or ‘allow-nan’, default=True

Input validation option forwarded to scikit-learn.

copybool, default=False

Input validation option forwarded to scikit-learn.

output_type{‘item’}, default=’item’

Must be ‘item’ for CSR construction.

Returns:
graphscipy.sparse.csr_matrix

CSR matrix of shape (n_queries, n_items).

Raises:
ImportError

If SciPy is not installed.

ValueError

If mode is invalid or output_type != 'item'.

RuntimeError

If the backend returns an out-of-range neighbor id.

Parameters:
Return type:

Any

See also

kneighbors

Dense kNN results.

load(fn, prefault=None)#

Load (mmap) an index from disk into the current object.

Parameters:
fnstr

Path to a file previously created by save or on_disk_build.

prefaultbool or None, optional, default=None

If True, fault pages into memory when the file is mapped. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If the file cannot be opened or mapped.

RuntimeError

If the index is not initialized or the file is incompatible.

See also

save

Save the current index to disk.

on_disk_build

Build directly using an on-disk backing file.

unload

Release mmap resources.

Notes

The in-memory index must have been constructed with the same dimension and metric as the on-disk file.

classmethod load_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#

Load a directory bundle created by save_bundle.

Parameters:
manifest_filename

Filename for the metadata manifest inside the directory.

index_filename

Filename for the Annoy index inside the directory.

prefault

Forwarded to load_index.

Returns:
index

Newly constructed index.

Raises:
AttributeError

If from_json is not available (compose with MetaMixin).

TypeError

If from_json returns an unexpected type.

OSError

On filesystem failures.

Parameters:
  • manifest_filename (str)

  • index_filename (str)

  • prefault (bool | None)

Return type:

Self

classmethod load_index(f, metric, path, *, prefault=None)[source]#

Load (mmap) an Annoy index file into this object.

Parameters:
f

Vector dimension for construction.

metric

Metric name for construction.

pathstr or os.PathLike

Path to a file previously created by save_index or the backend save.

prefault

Forwarded to the backend. If None, the backend default is used.

Raises:
AttributeError

If the backend does not provide load(path, prefault=...).

OSError

If loading fails (backend or filesystem).

Parameters:
Return type:

Self

memory_usage() int#

Approximate memory usage of the index in bytes.

Returns:
n_bytesint or None

Approximate number of bytes used by the index. Returns None if the index is not initialized or the forest has not been built yet.

Raises:
RuntimeError

If memory usage cannot be computed.

metric#

Distance metric for the index. Valid values:

  • ‘angular’ -> Cosine-like distance on normalized vectors.

  • ‘euclidean’ -> L2 distance.

  • ‘manhattan’ -> L1 distance.

  • ‘dot’ -> Negative dot-product distance (inner product).

  • ‘hamming’ -> Hamming distance for binary vectors.

Aliases (case-insensitive):

  • angular : cosine

  • euclidean : l2, lstsq

  • manhattan : l1, cityblock, taxicab

  • dot : @, ., dotproduct, inner, innerproduct

  • hamming : hamming

Returns:
str or None

Canonical metric name, or None if not configured yet.

Notes

Changing metric after the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.

For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and y). You must call fit (or add_item + build) again before querying.

n_features#

Alias of f (dimension), provided for scikit-learn naming parity.

n_features_#

Read-only alias of n_features_in_.

n_features_in_#

Number of features seen during fit (scikit-learn compatible). Alias of f when available.

n_features_out_#

Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.

n_neighbors#

Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).

on_disk_build(fn)#

Configure the index to build using an on-disk backing file.

Parameters:
fnstr

Path to a file that will hold the index during build. The file is created or overwritten as needed.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Build trees after adding items (on-disk backed).

rebuild

Return a new Annoy index rebuilt from the current index contents.

fit

Build the index from X (preferred if you already have X available).

load

Memory-map the built index.

save

Persist the built index to disk.

Notes

This mode is useful for very large datasets that do not fit comfortably in RAM during construction.

on_disk_path#

Path used for on-disk build/load/save operations.

Returns:
str or None

Filesystem path used for on-disk operations, or None if not configured.

Notes

  • Assigning a string/PathLike to on_disk_path configures on-disk build mode (equivalent to calling on_disk_build with the same filename).

  • Note: Annoy core truncates the target file when enabling on-disk build. on_disk_path is strictly equivalent to calling on_disk_build with the same filename (truncate allowed).

  • Assigning None (or an empty string) clears the configured path, but only when no disk-backed index is currently active.

  • Clearing/changing this while an on-disk index is active is disallowed. Call unload first.

property pickle_mode: Literal['auto', 'disk', 'byte']#

!! processed by numpydoc !!

plot_index(labels=None, *, ids=None, projection='pca', dims=(0, 1), center=True, maxabs=False, l2_normalize=False, dtype=<class 'numpy.float32'>, ax=None, title=None, plot_kwargs=None)[source]#

Plot this index as a 2D scatter plot.

This is a thin wrapper around plot_annoy_index that uses _plotting_backend.

Parameters:
labels, ids, projection, dims, center, maxabs, l2_normalize, dtype, ax, title, plot_kwargs

See plot_annoy_index.

Returns:
y2, ids_out, ax

See plot_annoy_index.

Parameters:
Return type:

tuple[ndarray, ndarray, Any]

See also

plot_annoy_index

Low-level plotting helper this method delegates to.

plot_knn_edges

Overlay kNN edges on the returned 2D coordinates.

Notes

  • This method does not mutate the index.

  • Plotting backends (e.g. Matplotlib) are imported lazily and are only required when this method is called.

  • The returned ids_out corresponds to the item id for each row in y2.

Examples

>>> import numpy as np
>>> import scikitplot.annoy as skann
>>> idx = skann.Index(f=10, metric="angular")
>>> # ... add items & build ...
>>> labels = np.zeros(idx.get_n_items(), dtype=int)
>>> y2, ids, ax = idx.plot_index(labels=labels, projection="pca")
plot_knn_edges(y2, *, ids=None, k=10, search_k=-1, ax=None, line_kwargs=None, undirected=True)[source]#

Overlay kNN edges onto an existing 2D index plot.

This is a thin wrapper around plot_annoy_knn_edges that uses _plotting_backend.

Parameters:
y2, ids, k, search_k, ax, line_kwargs, undirected

See plot_annoy_knn_edges.

Returns:
ax

The axes that were drawn on.

Parameters:
Return type:

Any

See also

plot_annoy_knn_edges

Low-level edge overlay helper this method delegates to.

plot_index

Computes the 2D coordinates used as input to this method.

Notes

  • y2 must represent 2D coordinates with shape (n_samples, 2).

  • If ids is provided, it must have length n_samples.

  • This method does not mutate the index; it only performs neighbor queries to draw edges.

Examples

>>> y2, ids, ax = idx.plot_index(labels=np.zeros(idx.get_n_items(), dtype=int))
>>> idx.plot_knn_edges(y2, ids=ids, k=5, line_kwargs={"alpha": 0.15})
prefault#

Default prefault flag stored on the object.

This setting is used as the default for per-call prefault arguments when prefault is omitted or set to None in methods like load and save.

Returns:
bool

Current prefault flag.

Notes

  • This flag does not retroactively change already-loaded mappings.

query_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#

Query neighbors by stored item id.

Parameters:
itemint

Stored item id.

n_neighborsint

Number of neighbors to return after applying exclusions.

search_kint, default=-1

Search parameter forwarded to the backend.

include_distancesbool, default=False

If True, also return distances.

exclude_selfbool, default=False

If True, exclude item from the returned neighbors.

exclude_item_idsiterable of int, optional

Additional item ids to exclude.

ensure_all_finitebool or ‘allow-nan’, default=True

Input validation option forwarded to scikit-learn.

copybool, default=False

Input validation option forwarded to scikit-learn.

Returns:
indicesnumpy.ndarray of shape (n_neighbors,)

Neighbor ids.

(indices, distances)tuple of numpy.ndarray

Returned when include_distances=True.

Raises:
sklearn.exceptions.NotFittedError

If the backend reports that the index is unbuilt.

ValueError

If n_neighbors <= 0 or not enough neighbors remain after exclusions.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

query_by_vector

Query neighbors by an explicit vector.

kneighbors

Batch neighbor queries (sklearn-like).

Notes

Exclusions are applied deterministically in the order returned by the backend.

query_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#

Query neighbors by an explicit vector.

Parameters:
vectorarray-like of shape (f,)

Query vector.

n_neighborsint

Number of neighbors to return after exclusions.

search_kint, default=-1

Search parameter forwarded to the backend.

include_distancesbool, default=False

If True, also return distances.

exclude_selfbool, default=False

If True, exclude the first returned candidate whose distance is exactly 0.0. This is intended for queries where vector comes from the index itself.

exclude_item_idsiterable of int, optional

Additional item ids to exclude.

ensure_all_finitebool or ‘allow-nan’, default=True

Input validation option forwarded to scikit-learn.

copybool, default=False

Input validation option forwarded to scikit-learn.

Returns:
indicesnumpy.ndarray of shape (n_neighbors,)

Neighbor ids.

(indices, distances)tuple of numpy.ndarray

Returned when include_distances=True.

Raises:
sklearn.exceptions.NotFittedError

If the backend reports that the index is unbuilt.

ValueError

If n_neighbors <= 0, vector dimension mismatches f, or not enough neighbors remain after exclusions.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

query_by_item

Query neighbors by stored item id.

kneighbors

Batch neighbor queries (sklearn-like).

Notes

Exclusions are applied deterministically in the order returned by the backend. If exclude_self=True and no exact 0.0 distance candidate is returned in the first position, no additional self-exclusion is applied.

query_vectors_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#

Query neighbor vectors by stored item id.

This is a convenience wrapper over query_by_item that materializes vectors using the backend’s get_item_vector.

Parameters:
item, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids

See query_by_item.

ensure_all_finite, copy

See query_by_vector.

dtypenumpy dtype, default=numpy.float32

Output dtype for the returned vectors.

output_type{‘item’, ‘vector’}, default=’vector’

If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.

Returns:
vectorsnumpy.ndarray of shape (n_neighbors, f)

Neighbor vectors.

(vectors, distances)tuple

Returned when include_distances=True.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

query_vectors_by_vector

Vector query returning vectors (or ids).

query_vectors_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#

Query neighbor vectors by an explicit vector.

Convenience wrapper over query_by_vector. By default it returns vectors; set output_type='item' to return neighbor ids instead.

Parameters:
vector, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids,

See query_by_item.

ensure_all_finite, copy

See query_by_vector.

dtypenumpy dtype, default=numpy.float32

Output dtype for the returned vectors.

output_type{‘item’, ‘vector’}, default=’vector’

If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.

Returns:
neighborsnumpy.ndarray

If output_type='vector', an array of shape (n_neighbors, f). If output_type='item', an array of shape (n_neighbors,).

(neighbors, distances)tuple

Returned when include_distances=True.

Parameters:
Return type:

ndarray | tuple[ndarray, ndarray]

See also

query_vectors_by_item

Item id query returning vectors.

query_by_vector

Per-query id interface.

random_state#

Alias of seed (scikit-learn convention).

rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) Annoy#

Return a new Annoy index rebuilt from the current index contents.

This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.

Parameters:
metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional

Metric for the new index. If None, reuse the current metric.

on_disk_pathpath-like or None, optional

Optional on-disk build path for the new index.

Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.

n_treesint or None, optional

If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.

n_jobsint, optional, default=-1

Number of threads to use while building (-1 means “auto”).

Returns:
Annoy

A new Annoy instance containing the same items (and y metadata if present).

See also

build

Build trees after adding items (on-disk backed).

on_disk_build

Configure on-disk build mode.

fit

Build the index from X (preferred if you already have X available).

get_params

Read constructor parameters.

set_params

Update estimator parameters (use with fit(X) when refitting from data).

serialize, deserialize

Persist / restore indexes; canonical restores rebuild deterministically.

__sklearn_clone__

Unfitted clone hook (no fitted state).

Notes

rebuild(metric=...) is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.

Use rebuild() when you want to change metric while reusing the already-stored vectors (e.g., you do not want to re-read or re-materialize X, or you loaded an index from disk and only have access to its stored vectors).

repr_info(include_n_items=True, include_n_trees=True, include_memory=None) str#

Return a dict-like string representation with optional extra fields.

Unlike __repr__, this method can include additional fields on demand. Note that include_memory=True may be expensive for large indexes. Memory is calculated after build.

save(fn, prefault=None)#

Persist the index to a binary file on disk.

Parameters:
fnstr

Path to the output file. Existing files will be overwritten.

prefaultbool or None, optional, default=None

If True, aggressively fault pages into memory during save. If None, use the stored prefault value. Primarily useful on some platforms for very large indexes.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
IOError

If the file cannot be written.

RuntimeError

If the index is not initialized or save fails.

See also

load

Load an index from disk.

on_disk_build

Configure on-disk build mode.

serialize

Snapshot to bytes for in-memory persistence.

deserialize

Restore an index from a serialized byte string.

Notes

The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored prefault setting.

save_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#

Save a directory bundle containing metadata + the index file.

The bundle contains: - manifest.json: metadata payload produced by to_json - index.ann: Annoy index produced by save_index

Parameters:
manifest_filename

Filename for the metadata manifest inside the directory.

index_filename

Filename for the Annoy index inside the directory.

prefault

Forwarded to save_index.

Raises:
AttributeError

If to_json is not available (compose with MetaMixin).

OSError

On filesystem failures.

Parameters:
  • manifest_filename (str)

  • index_filename (str)

  • prefault (bool | None)

Return type:

list[str]

save_index(path, *, prefault=None)[source]#

Persist the Annoy index to disk.

Parameters:
pathstr or os.PathLike

Destination path for the Annoy index file.

prefault

Forwarded to the backend. If None, the backend default is used.

Raises:
AttributeError

If the backend does not provide save(path, prefault=...).

OSError

For filesystem-level failures.

Parameters:
Return type:

Self

schema_version#

Serialization/compatibility strategy marker sentinel value.

This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.

Returns:
int

Current schema version marker.

Notes

  • 0 or 1: pickle stores a portable-v1 snapshot (fast restore, ABI-checked).

  • 2: pickle stores canonical-v1 (portable; restores by rebuilding deterministically).

  • >=3: pickle stores both portable and canonical; canonical is used as a fallback.

seed#

Random seed override (scikit-learn compatible). None means use Annoy default seed.

serialize(format=None) bytes#

Serialize the built in-memory index into a byte string.

Parameters:
format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format.

  • “native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.

  • “portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.

  • “canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:
databytes

Opaque binary blob containing the Annoy index.

Raises:
RuntimeError

If the index is not initialized or serialization fails.

OverflowError

If the serialized payload is too large to fit in a Python bytes object.

See also

deserialize

Restore an index from a serialized byte string.

on_disk_build

Configure on-disk build mode.

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

set_params(**params) Annoy#

Set estimator-style parameters (scikit-learn compatibility).

Parameters:
**params

Keyword parameters to set. Unknown keys raise ValueError.

Returns:
Annoy

This instance (self), enabling method chaining.

Raises:
ValueError

If an unknown parameter name is provided.

TypeError

If parameter names are not strings or types are invalid.

See also

get_params

Return estimator-style parameters.

Notes

Changing structural parameters (notably metric) on an already initialized index resets the index deterministically (drops all items, trees, and y). Refit/rebuild is required before querying.

This behavior matches scikit-learn expectations: set_params may be called at any time, but parameter changes that affect learned state invalidate the fitted model.

set_seed(seed=None)#

Set the random seed used for tree construction.

Parameters:
seedint or None, optional, default=None

Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value 0 resets to Annoy’s core default seed (with a UserWarning).

  • If omitted (or None, NULL), the seed is set to Annoy’s default seed.

  • If 0, clear any pending override and reset to Annoy’s default seed (a UserWarning is emitted).

Returns:
Annoy

This instance (self), enabling method chaining.

See also

seed

Parameter attribute (int | None).

Notes

Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.

set_verbose(level=1)#

Set the verbosity level (callable setter).

This method exists to preserve a callable interface while keeping the parameter name verbose available as an attribute for scikit-learn compatibility.

Parameters:
levelint, optional, default=1

Verbosity level. Values are clamped to the range [-2, 2]. level >= 1 enables Annoy’s verbose logging; level <= 0 disables it. Logging level inspired by gradient-boosting libraries:

  • <= 0 : quiet (warnings only)

  • 1 : info (Annoy’s verbose=True)

  • >= 2 : debug (currently same as info, reserved for future use)

Returns:
Annoy

This instance (self), enabling method chaining.

See also

verbose

Parameter attribute (int | None).

set_verbosity

Alias of set_verbose.

get_params, set_params

Estimator parameter API.

set_verbosity(level=1)#

Alias of set_verbose.

See also

verbose

Parameter attribute (int | None).

set_verbose

Set the verbosity level (callable setter).

to_bytes(format=None)[source]#

Serialize the built index to bytes (backend serialize).

Parameters:
format{“native”, “portable”, “canonical”} or None, optional, default=None

Serialization format. If None used "canonical"

  • “native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.

  • “portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.

  • “canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.

Returns:
data

Serialized index bytes.

Raises:
AttributeError

If the backend does not provide serialize.

RuntimeError

If serialization fails.

TypeError

If the backend returns non-bytes-like data.

Return type:

bytes

Notes

“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.

“Canonical” blobs trade load time for portability: deserialization rebuilds the index with n_jobs=1 for deterministic reconstruction.

to_json(path=None, *, indent=2, sort_keys=True, ensure_ascii=False, include_info=True, strict=True)[source]#

Serialize to_metadata to JSON.

Parameters:
path

If provided, write the JSON to this path atomically.

indent

Indentation level passed to json.dumps.

sort_keys

If True, sort keys for stable output.

ensure_ascii

If True, escape non-ASCII characters.

include_info, strict

Forwarded to to_metadata.

Returns:
json_str

JSON representation of the metadata.

Raises:
TypeError

If the exported metadata contains non-JSON-serializable values.

Parameters:
Return type:

str

to_metadata(*, include_info=True, strict=True)[source]#

Export a serializable metadata payload.

Parameters:
include_info

If True, include an info() mapping when available.

strict

If True, failures in optional info() propagation raise.

Returns:
metadata

A JSON/YAML-serializable mapping containing configuration parameters and optional info.

Raises:
RuntimeError

If _META_SCHEMA_VERSION is missing on the concrete class.

TypeError

If get_params does not return a mapping.

AttributeError

If neither the instance nor the backend implements get_params.

TypeError

If a persistence knob (e.g., pickle_mode) is not JSON/YAML-serializable.

Parameters:
Return type:

IndexMetadata

See also

to_json
to_yaml
to_numpy(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#

Export vectors to a dense NumPy array.

See also

get_item_vectors

Dense export with optional id output.

iter_item_vectors

Streaming export.

to_scipy_csr

Export as SciPy CSR.

to_pandas

Export as pandas DataFrame.

Notes

This is an alias of get_item_vectors with return_ids=False.

Parameters:
Return type:

ndarray

to_pandas(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, id_location='index', id_name='id', columns=None, validate_vector_len=True)[source]#

Export vectors to a pandas DataFrame.

Parameters:
ids, start, stop, n_rows

Selection controls. See get_item_vectors.

dtypenumpy dtype, default=numpy.float32

Output dtype.

id_location{‘index’, ‘column’, ‘both’, ‘none’}, default=’index’

Where to place ids in the output.

id_namestr, default=’id’

Name used for the id column / index.

columnssequence of str, optional

Column names for vector dimensions. If None, uses feature_names_in_ when present and length matches f; otherwise uses feature_0..feature_{f-1}.

validate_vector_lenbool, default=True

If True, verify every fetched vector has length f.

Returns:
dfpandas.DataFrame

DataFrame with shape (n_rows, f) plus optional id metadata.

Raises:
ImportError

If pandas is not installed.

ValueError

If id_location is invalid or columns length mismatches f.

Parameters:
Return type:

Any

See also

to_numpy

Dense NumPy export.

to_scipy_csr

Export as SciPy CSR.

to_scipy_csr(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#

Export vectors as a SciPy CSR matrix.

Returns:
Xscipy.sparse.csr_matrix

CSR matrix with shape (n_rows, f).

Raises:
ImportError

If SciPy is not installed.

Parameters:
Return type:

Any

See also

to_numpy

Dense NumPy export.

to_pandas

Export as pandas DataFrame.

to_yaml(path=None, *, include_info=True, strict=True)[source]#

Serialize to_metadata to YAML (requires PyYAML).

Parameters:
Return type:

str

transform(X, *, n_neighbors=5, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None, input_type='vector', output_type='vector', exclude_self=False, exclude_items=None, missing_value=None)#

Transform queries into nearest-neighbor results (ids or vectors; optional distances / labels).

Parameters:
Xarray-like

Query inputs. The expected shape/type depends on input_type:

  • input_type=’item’ : X must be a 1D sequence of item ids.

  • input_type=’vector’: X must be a 2D array-like of shape (n_queries, f).

n_neighborsint or None, default=5

Number of neighbors to retrieve for each query. For backwards compatibility this keyword is accepted, but it must match the estimator parameter n_neighbors (STRICT schema).

search_kint, default=-1

Search parameter passed to Annoy (-1 uses Annoy’s default).

include_distancesbool, default=False

If True, also return per-neighbor distances.

return_labelsbool, default=False

If True, also return per-neighbor labels resolved from y (as set via fit).

y_fill_valueobject, default=None

Value used when y is unset or missing an entry for a neighbor id.

input_type{‘vector’, ‘item’}, default=’vector’

Controls how X is interpreted.

output_type{‘vector’, ‘item’}, default=’vector’

Controls what neighbors are returned. - output_type=’item’: return neighbor ids. - output_type=’vector’: return neighbor vectors.

exclude_selfbool, default=False

If True, exclude the query item id from results. Only valid when input_type=’item’.

exclude_itemssequence of int or None, default=None

Explicit neighbor ids to exclude from results.

missing_valuefloat or None, default=None

If not None, imputes missing entries in X (None values in dense rows; missing keys / None values in dict rows). If None, missing entries raise.

Returns:
neighborslist

Neighbor results for each query. - output_type=’item’ : list of list of int - output_type=’vector’: list of list of list of float

(neighbors, distances)tuple

Returned when include_distances=True.

(neighbors, labels)tuple

Returned when return_labels=True.

(neighbors, distances, labels)tuple

Returned when include_distances=True and return_labels=True.

See also

get_nns_by_item

Neighbor search by item id.

get_nns_by_vector

Neighbor search by query vector.

fit

Build the index from X (preferred if you already have X available).

fit_transform

Estimator-style APIs.

Notes

  • Excluding self is performed by matching neighbor ids to the query id (not by checking distance values).

  • For input_type=’vector’, exclude_self=True is an error; use exclude_items for explicit, deterministic filtering.

  • If exclusions prevent returning exactly n_neighbors results, this method raises ValueError.

Examples

Item queries (exclude the query id itself):

>>> idx.transform([10, 20], input_type='item', output_type='item', n_neighbors=5, exclude_self=True)

Vector queries (exclude explicit ids):

>>> idx.transform(X_query, input_type='vector', output_type='item', n_neighbors=5, exclude_items=[10, 20])

Return neighbor vectors:

>>> idx.transform([10], input_type='item', output_type='vector', n_neighbors=5, exclude_self=True)
unbuild()#

Discard the current forest, allowing new items to be added.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

build

Rebuild the forest after adding new items.

rebuild

Return a new Annoy index rebuilt from the current index contents.

fit

Build the index from X (preferred if you already have X available).

add_item

Add items (only valid when no trees are built).

Notes

After calling unbuild, you must call build again before running nearest-neighbour queries.

unload()#

Unmap any memory-mapped file backing this index.

Returns:
Annoy

This instance (self), enabling method chaining.

See also

load

Memory-map an on-disk index into this object.

on_disk_build

Configure on-disk build mode.

Notes

This releases OS-level resources associated with the mmap, but keeps the Python object alive.

verbose#

set_verbose().

Type:

Verbosity level in [-2, 2] or None (unset). Callable setter

y#

Labels / targets associated with the index items.

Notes

If provided to fit(X, y), labels are stored here after a successful build. You may also set this property manually. When possible, the setter enforces that len(y) matches the current number of items (n_items).