Annoy#
- class scikitplot.cexternals._annoy.Annoy[source]#
Compiled with GCC/Clang. Using 512-bit AVX instructions.
Approximate Nearest Neighbors index (Annoy) with a small, lazy C-extension wrapper.
>>> Annoy( >>> f=None, >>> metric=None, >>> *, >>> n_neighbors=5, >>> on_disk_path=None, >>> prefault=None, >>> seed=None, >>> verbose=None, >>> schema_version=None, >>> )
- Parameters:
- fint or None, optional, default=None
Vector dimension. If
0orNone, dimension may be inferred from the first vector passed toadd_item(lazy mode). If None, treated as0(reset to default).- metric{“angular”, “cosine”, “euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”} or None, optional, default=None
Distance metric (one of ‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’). If omitted and
f > 0, defaults to'angular'(cosine-like). If omitted andf == 0, metric may be set later before construction. If None, behavior depends onf:If
f > 0: defaults to'angular'(legacy behavior; may emit aFutureWarning).If
f == 0: leaves the metric unset (lazy). You may setmetriclater before construction, or it will default to'angular'on firstadd_item.
- n_neighborsint, default=5
Non-negative integer Number of neighbors to retrieve for each query.
- on_disk_pathstr or None, optional, default=None
If provided, configures the path for on-disk building. When the underlying index exists, this enables on-disk build mode (equivalent to calling
on_disk_buildwith the same filename).Note: Annoy core truncates the target file when enabling on-disk build. This wrapper treats
on_disk_pathas strictly equivalent to callingon_disk_buildwith the same filename (truncate allowed).In lazy mode (
f==0and/ormetric is None), activation occurs once the underlying C++ index is created.- prefaultbool or None, optional, default=None
If True, request page-faulting index pages into memory when loading (when supported by the underlying platform/backing). If None, treated as
False(reset to default).- seedint or None, optional, default=None
Non-negative integer seed. If set before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value
0is treated as "use Annoy’s deterministic default seed" (aUserWarningis emitted when0is explicitly provided).- verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- schema_versionint, optional, default=None
Serialization/compatibility strategy marker.
This does not change the Annoy on-disk format, but it does control how the index is snapshotted in pickles.
0or1: pickle stores aportable-v1snapshot (fast restore, ABI-checked).2: pickle storescanonical-v1(portable across ABIs; restores by rebuilding deterministically).>=3: pickle stores both portable and canonical (canonical is used as a fallback if the ABI check fails).
If None, treated as
0(reset to default).
- Attributes:
fint, default=0Vector dimension.
metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’}, default=”angular”Distance metric for the index.
n_neighborsint, default=5Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).
on_disk_pathstr or None, optional, default=NonePath used for on-disk build/load/save operations.
prefaultbool, default=FalseDefault prefault flag stored on the object.
seedint or None, optional, default=NoneRandom seed override (scikit-learn compatible).
verboseint or None, optional, default=NoneVerbosity level in [-2, 2] or None (unset).
schema_versionint, default=0Serialization/compatibility strategy marker sentinel value.
n_featuresintAlias of
f(dimension), provided for scikit-learn naming parity.n_features_out_intNumber of output features produced by transform (SLEP013).
feature_names_in_list-likeInput feature names seen during fit (SLEP007).
ydict | None, optional, default=NoneLabels / targets associated with the index items.
See also
add_itemAdd a vector to the index.
buildBuild the forest after adding items.
unbuildRemove trees to allow adding more items.
get_nns_by_item,get_nns_by_vectorQuery nearest neighbours.
save,loadPersist the index to/from disk.
serialize,deserializePersist the index to/from bytes.
set_seedSet the random seed deterministically.
verboseSet verbosity level.
infoReturn a structured summary of the current index.
Notes
Once the underlying C++ index is created,
fandmetricare immutable. This keeps the object consistent and avoids undefined behavior.The C++ index is created lazily when sufficient information is available: when both
f > 0andmetricare known, or when an operation that requires the index is first executed.If
f == 0, the dimensionality is inferred from the first non-empty vector passed toadd_itemand is then fixed for the lifetime of the index.Assigning
Nonetofis not supported. Use0for lazy inference (this matchesAnnoy(f=None, ...)at construction time).If
metricis omitted whilef > 0, the current behavior defaults to'angular'and may emit aFutureWarning. To avoid warnings and future behavior changes, always passmetric=...explicitly.Items must be added before calling
build. Afterbuild, the index becomes read-only; to add more items, callunbuild, add items again withadd_item, then callbuildagain.Very large indexes can be built directly on disk with
on_disk_buildand then memory-mapped withload.inforeturns a structured summary (dimension, metric, counts, and optional memory usage) suitable for programmatic inspection.This wrapper stores user configuration (e.g., seed/verbosity) even before the C++ index exists and applies it deterministically upon construction.
Developer Notes:
Source of truth:
f(int) andmetric_id(enum) describe configuration.ptris NULL when index is not constructed.
Invariant:
ptr != NULLimpliesf > 0andmetric_id != METRIC_UNKNOWN.
Examples
>>> from annoy import Annoy, AnnoyIndex
High-level API:
>>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex >>> from scikitplot.annoy import Annoy, AnnoyIndex, Index
The lifecycle follows the examples in
test.ipynb:Construct the index
>>> import random; random.seed(0) >>> # from annoy import AnnoyIndex >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex >>> from scikitplot.annoy import Annoy, AnnoyIndex, Index
>>> idx = Annoy(f=3, metric="angular") >>> idx.f, idx.metric (3, 'angular')
If you pass
f=0the dimension can be inferred on the first call toadd_item.Add items
>>> idx.add_item(0, [1.0, 0.0, 0.0]) >>> idx.add_item(1, [0.0, 1.0, 0.0]) >>> idx.add_item(2, [0.0, 0.0, 1.0]) >>> idx.get_n_items() 3
Build the forest
>>> idx.build(n_trees=-1) >>> idx.get_n_trees() 10 >>> idx.memory_usage() # byte 543076
After
buildthe index becomes read-only. You can still query, save, load and serialize it.Query neighbours
By stored item id:
>>> idx.get_nns_by_item(0, 5) [0, 1, 2, ...]
With distances:
>>> idx.get_nns_by_item(0, 5, include_distances=True) ([0, 1, 2, ...], [0.0, 1.22, 1.26, ...])
Or by an explicit query vector:
>>> idx.get_nns_by_vector([0.1, 0.2, 0.3], 5, include_distances=True) ([103, 71, 160, 573, 672], [...])
Persistence
To work with memory-mapped indices on disk:
>>> idx.save("annoy_test.annoy") >>> idx2 = Annoy(f=100, metric="angular") >>> idx2.load("annoy_test.annoy") >>> idx2.get_n_items() 1000
Or via raw byte:
>>> buf = idx.serialize() >>> new_idx = Annoy(f=100, metric="angular") >>> new_idx.deserialize(buf) >>> new_idx.get_n_items() 1000
You can release OS resources with
unloadand drop the current forest withunbuild.- add_item(i, vector)#
Add a single embedding vector to the index.
- Parameters:
- iint
Item id (index) must be non-negative. Ids may be non-contiguous; the index allocates up to
max(i) + 1.- vectorsequence of float
1D embedding of length
f. Values are converted tofloat. Iff == 0and this is the first item,fis inferred fromvectorand then fixed for the lifetime of this index.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
buildBuild the forest after adding items.
unbuildRemove trees to allow adding more items.
get_nns_by_item,get_nns_by_vectorQuery nearest neighbours.
Notes
Items must be added before calling
build. After building the forest, further calls toadd_itemare not supported.Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> f=100 >>> n=1000 >>> idx = AnnoyIndex(f, metric='l2') ... >>> for i in range(n): ... v = [random.gauss(0, 1) for _ in range(f)] ... idx.add_item(i, v)
- build(n_trees, n_jobs=-1)#
Build a forest of random projection trees.
- Parameters:
- n_treesint
Number of trees in the forest. Larger values typically improve recall at the cost of slower build time and higher memory usage.
If set to
n_trees=-1, trees are built dynamically until the index reaches approximately twice the number of items_n_nodes >= 2 * n_items.Guidelines:
Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.
- n_jobsint, optional, default=-1
Number of threads to use while building.
-1means “auto” (use the implementation’s default, typically all available CPU cores).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
fitBuild the index from
X(preferred if you already haveXavailable).add_itemAdd vectors before building.
unbuildDrop trees to add more items.
rebuildReturn a new Annoy index rebuilt from the current index contents.
on_disk_buildConfigure on-disk build mode.
get_nns_by_item,get_nns_by_vectorQuery nearest neighbours.
save,loadPersist the index to/from disk.
Notes
After
buildcompletes, the index becomes read-only for queries. To add more items, callunbuild, add items, and then rebuild.References
[1]Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> f=100 >>> n=1000 >>> idx = AnnoyIndex(f, metric='l2') ... >>> for i in range(n): ... v = [random.gauss(0, 1) for _ in range(f)] ... idx.add_item(i, v) >>> idx.build(10)
- deserialize(byte, prefault=None)#
Restore the index from a serialized byte string.
- Parameters:
- bytebytes
Byte string produced by
serialize. Both native (legacy) blobs and portable blobs (created withserialize(format='portable')) are accepted; portable and canonical blobs are auto-detected. Canonical blobs restore by rebuilding the index deterministically.- prefaultbool or None, optional, default=None
Accepted for API symmetry with
load. If None, the stored Ignored for canonical blobs.prefaultvalue is used.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If deserialization fails due to invalid or incompatible data.
- RuntimeError
If the index is not initialized.
See also
serializeCreate a binary snapshot of the index.
on_disk_buildConfigure on-disk build mode.
Notes
Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.
- f#
Vector dimension.
- Returns:
- int
Dimension of each item vector.
0means unknown / lazy.
Notes
Annoy(f=None, ...)is supported at construction time and is treated asf=0.0(orNone) means “unknown / lazy”: the first call toadd_itemwill inferffrom the input vector length and then fix it.
Changing
fafter the index has been initialized (items added and/or trees built) is a structural change: the stored items and all tree splits depend on the vector dimension.For scikit-learn compatibility, assigning a different
f(orNone) on an already initialized index will deterministically reset the index (drop all items, trees, andy). You must callfit(oradd_item+build) again before querying.
- feature_names_in_#
Input feature names seen during fit (SLEP007). Set only when explicitly provided via fit(…, feature_names=…).
- fit(X=None, y=None, \*, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None)#
Fit the Annoy index (scikit-learn compatible).
This method supports two deterministic workflows:
Manual add/build: If X is None and y is None, fit() builds the forest using items previously added via add_item().
Array-like X: If X is provided (2D array-like), fit() optionally resets or appends, adds all rows as items, then builds the forest.
- Parameters:
- Xarray-like of shape (n_samples, n_features), default=None
Vectors to add to the index. If None (and y is None), fit() only builds.
- yarray-like of shape (n_samples,), default=None
Optional labels associated with X. Stored as
yafter successful build.- n_treesint, default=-1
Number of trees to build. Use -1 for Annoy’s internal default.
- n_jobsint, default=-1
Number of threads to use during build (-1 means “auto”).
- resetbool, default=True
If True, clear existing items before adding X. If False, append.
- start_indexint or None, default=None
Item id for the first row of X. If None, uses 0 when reset=True, otherwise uses current n_items when reset=False.
- missing_valuefloat or None, default=None
If not None, imputes missing entries in X.
Dense rows: replaces None elements with missing_value.
Dict rows: fills missing keys (and None values) with missing_value.
If None, missing entries raise an error (strict mode).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
fit_transformEstimator-style APIs.
transformQuery the built index.
add_itemAdd one item at a time.
buildBuild the forest after manual calls to add_item.
on_disk_buildConfigure on-disk build mode.
unbuildRemove trees so items can be appended.
yStored labels
y(if provided).get_params,set_paramsEstimator parameter API.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> n, f = 10_000, 1_000 >>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)] >>> q = [[random.gauss(0, 1) for _ in range(f)]] ... >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... idx = AnnoyIndex().set_params(metric=m).fit(X) ... print(m, idx.transform(q)) ... >>> idx = AnnoyIndex().fit(X) >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... idx_m = base.rebuild(metric=m) # rebuild-from-index ... print(m, idx_m.transform(q)) # no .fit(X) here
- fit_transform(X, y=None, \*, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None, n_neighbors=None, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None)#
Fit the index and transform X in a single deterministic call.
This is equivalent to:
self.fit(X, y=y, n_trees=…, n_jobs=…, reset=…, start_index=…, missing_value=…) self.transform(X, n_neighbors=…, search_k=…, include_distances=…, return_labels=…, y_fill_value=…, missing_value=…)
See also
fitBuild the index from
X(preferred if you already haveXavailable).transformQuery the built index.
on_disk_buildConfigure on-disk build mode.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> n, f = 10_000, 1_000 >>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)] >>> q = [[random.gauss(0, 1) for _ in range(f)]] ... >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... print(m, AnnoyIndex().set_params(metric=m).fit_transform(q))
- get_distance(i, j) float#
Return the distance between two stored items.
- Parameters:
- i, jint
Item ids (index) of two stored samples.
- Returns:
- dfloat
Distance between items
iandjunder the current metric.
- Raises:
- RuntimeError
If the index is not initialized.
- IndexError
If either index is out of range.
- get_feature_names_out(input_features=None)#
Get output feature names for the transformer-style API.
- Parameters:
- input_featuressequence of str or None, optional, default=None
If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.
- Returns:
- tuple of str
Output feature names:
('neighbor_0', ..., 'neighbor_{k-1}')wherek == n_neighbors.
- Raises:
- AttributeError
If called before
fit/build.- ValueError
If
input_featuresis provided but does not matchfeature_names_in_.
- get_item_vector(i) list[float]#
Return the stored embedding vector for a given item id.
- Parameters:
- iint
Item id (index) previously passed to
add_item.
- Returns:
- vectorlist[float]
Stored embedding of length
f.
- Raises:
- RuntimeError
If the index is not initialized.
- IndexError
If
iis out of range.
- get_n_items() int#
Return the number of stored items in the index.
- Returns:
- n_itemsint
Number of items that have been added and are currently addressable.
- Raises:
- RuntimeError
If the index is not initialized.
- get_n_trees() int#
Return the number of trees in the current forest.
- Returns:
- n_treesint
Number of trees that have been built.
- Raises:
- RuntimeError
If the index is not initialized.
- get_nns_by_item(i, n, search_k=-1, include_distances=False)#
Return the
nnearest neighbours for a stored item id.- Parameters:
- iint
Item id (index) previously passed to
add_item(i, embedding).- nint
Number of nearest neighbours to return.
- search_kint, optional, default=-1
Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional, default=False
If True, return a
(indices, distances)tuple. Otherwise return only the list of indices.
- Returns:
- indiceslist[int] | tuple[list[int], list[float]]
If
include_distances=False: list of neighbour item ids. Ifinclude_distances=True:(indices, distances).
- Raises:
- RuntimeError
If the index is not initialized or has not been built.
- IndexError
If
iis out of range.
See also
get_nns_by_vectorQuery with an explicit query embedding.
- get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#
Return the
nnearest neighbours for a query embedding.- Parameters:
- vectorsequence of float
Query embedding of length
f.- nint
Number of nearest neighbours to return.
- search_kint, optional, default=-1
Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional, default=False
If True, return a
(indices, distances)tuple. Otherwise return only the list of indices.
- Returns:
- indiceslist[int] | tuple[list[int], list[float]]
If
include_distances=False: list of neighbour item ids. Ifinclude_distances=True:(indices, distances).
- Raises:
- RuntimeError
If the index is not initialized or has not been built.
- ValueError
If
len(vector) != f.
See also
get_nns_by_itemQuery by stored item id.
- get_params(deep=True) dict#
Return estimator-style parameters (scikit-learn compatibility).
- Parameters:
- deepbool, optional, default=True
Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.
- Returns:
- paramsdict
Dictionary of stable, user-facing parameters.
See also
set_paramsSet estimator-style parameters.
schema_versionControls pickle / snapshot strategy.
Notes
This is intended to make Annoy behave like a scikit-learn estimator for tools such as
sklearn.base.cloneand parameter grids.
- info(include_n_items=True, include_n_trees=True, include_memory=None) dict#
Return a structured summary of the index.
This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.
- Parameters:
- include_n_itemsbool, optional, default=True
If True, include
n_items.- include_n_treesbool, optional, default=True
If True, include
n_trees.- include_memorybool or None, optional, default=None
Controls whether memory usage fields are included.
None: include memory usage only if the index is built.True: include memory usage if available (built).False: omit memory usage fields.
Memory usage is computed after
buildand may be expensive for very large indexes.
- Returns:
- infodict
Dictionary describing the current index state.
See also
serializeCreate a binary snapshot of the index.
deserializeRestore from a binary snapshot.
savePersist the index to disk.
loadLoad the index from disk.
Notes
Some keys are optional depending on include_* flags.
Keys:
- fint, default=0
Dimensionality of the index.
- metricstr, default=’angular’
Distance metric name.
- on_disk_pathstr, default=’’
Path used for on-disk build, if configured.
- prefaultbool, default=False
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.
- schema_versionint, default=0
Stored schema/version marker on this object (reserved for future use).
- seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.
- verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
Optional Keys:
- n_itemsint
Number of items currently stored.
- n_treesint
Number of built trees in the forest.
- memory_usage_byteint
Approximate memory usage in bytes. Present only when requested and available.
- memory_usage_mibfloat
Approximate memory usage in MiB. Present only when requested and available.
Examples
>>> info = idx.info() >>> info['f'] 100 >>> info['n_items'] 1000
- load(fn, prefault=None)#
Load (mmap) an index from disk into the current object.
- Parameters:
- fnstr
Path to a file previously created by
saveoron_disk_build.- prefaultbool or None, optional, default=None
If True, fault pages into memory when the file is mapped. If None, use the stored
prefaultvalue. Primarily useful on some platforms for very large indexes.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If the file cannot be opened or mapped.
- RuntimeError
If the index is not initialized or the file is incompatible.
See also
saveSave the current index to disk.
on_disk_buildBuild directly using an on-disk backing file.
unloadRelease mmap resources.
Notes
The in-memory index must have been constructed with the same dimension and metric as the on-disk file.
- memory_usage() int#
Approximate memory usage of the index in bytes.
- Returns:
- n_bytesint or None
Approximate number of bytes used by the index. Returns
Noneif the index is not initialized or the forest has not been built yet.
- Raises:
- RuntimeError
If memory usage cannot be computed.
- metric#
Distance metric for the index. Valid values:
‘angular’ -> Cosine-like distance on normalized vectors.
‘euclidean’ -> L2 distance.
‘manhattan’ -> L1 distance.
‘dot’ -> Negative dot-product distance (inner product).
‘hamming’ -> Hamming distance for binary vectors.
Aliases (case-insensitive):
angular : cosine
euclidean : l2, lstsq
manhattan : l1, cityblock, taxicab
dot : @, ., dotproduct, inner, innerproduct
hamming : hamming
- Returns:
- str or None
Canonical metric name, or None if not configured yet.
Notes
Changing
metricafter the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and
y). You must callfit(oradd_item+build) again before querying.
- n_features#
Alias of
f(dimension), provided for scikit-learn naming parity.
- n_features_#
Read-only alias of
n_features_in_.
- n_features_in_#
Number of features seen during fit (scikit-learn compatible). Alias of
fwhen available.
- n_features_out_#
Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.
- n_neighbors#
Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).
- on_disk_build(fn)#
Configure the index to build using an on-disk backing file.
- Parameters:
- fnstr
Path to a file that will hold the index during build. The file is created or overwritten as needed.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
Notes
This mode is useful for very large datasets that do not fit comfortably in RAM during construction.
- on_disk_path#
Path used for on-disk build/load/save operations.
- Returns:
- str or None
Filesystem path used for on-disk operations, or None if not configured.
See also
Notes
Assigning a string/PathLike to
on_disk_pathconfigures on-disk build mode (equivalent to callingon_disk_buildwith the same filename).Note: Annoy core truncates the target file when enabling on-disk build.
on_disk_pathis strictly equivalent to callingon_disk_buildwith the same filename (truncate allowed).Assigning
None(or an empty string) clears the configured path, but only when no disk-backed index is currently active.Clearing/changing this while an on-disk index is active is disallowed. Call
unloadfirst.
- prefault#
Default prefault flag stored on the object.
This setting is used as the default for per-call
prefaultarguments whenprefaultis omitted or set toNonein methods likeloadandsave.- Returns:
- bool
Current prefault flag.
Notes
This flag does not retroactively change already-loaded mappings.
- random_state#
Alias of
seed(scikit-learn convention).
- rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) Annoy#
Return a new Annoy index rebuilt from the current index contents.
This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.
- Parameters:
- metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional
Metric for the new index. If None, reuse the current metric.
- on_disk_pathpath-like or None, optional
Optional on-disk build path for the new index.
Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.
- n_treesint or None, optional
If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.
- n_jobsint, optional, default=-1
Number of threads to use while building (-1 means “auto”).
- Returns:
See also
buildBuild trees after adding items (on-disk backed).
on_disk_buildConfigure on-disk build mode.
fitBuild the index from
X(preferred if you already haveXavailable).get_paramsRead constructor parameters.
set_paramsUpdate estimator parameters (use with
fit(X)when refitting from data).serialize,deserializePersist / restore indexes; canonical restores rebuild deterministically.
__sklearn_clone__Unfitted clone hook (no fitted state).
Notes
rebuild(metric=...)is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.Use
rebuild()when you want to changemetricwhile reusing the already-stored vectors (e.g., you do not want to re-read or re-materializeX, or you loaded an index from disk and only have access to its stored vectors).
- repr_info(include_n_items=True, include_n_trees=True, include_memory=None) str#
Return a dict-like string representation with optional extra fields.
Unlike
__repr__, this method can include additional fields on demand. Note thatinclude_memory=Truemay be expensive for large indexes. Memory is calculated afterbuild.
- save(fn, prefault=None)#
Persist the index to a binary file on disk.
- Parameters:
- fnstr
Path to the output file. Existing files will be overwritten.
- prefaultbool or None, optional, default=None
If True, aggressively fault pages into memory during save. If None, use the stored
prefaultvalue. Primarily useful on some platforms for very large indexes.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If the file cannot be written.
- RuntimeError
If the index is not initialized or save fails.
See also
loadLoad an index from disk.
on_disk_buildConfigure on-disk build mode.
serializeSnapshot to bytes for in-memory persistence.
deserializeRestore an index from a serialized byte string.
Notes
The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored
prefaultsetting.
- schema_version#
Serialization/compatibility strategy marker sentinel value.
This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.
- Returns:
- int
Current schema version marker.
Notes
0or1: pickle stores aportable-v1snapshot (fast restore, ABI-checked).2: pickle storescanonical-v1(portable; restores by rebuilding deterministically).>=3: pickle stores both portable and canonical; canonical is used as a fallback.
- seed#
Random seed override (scikit-learn compatible). None means use Annoy default seed.
- serialize(format=None) bytes#
Serialize the built in-memory index into a byte string.
- Parameters:
- format{“native”, “portable”, “canonical”} or None, optional, default=None
Serialization format.
“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.
- Returns:
- databytes
Opaque binary blob containing the Annoy index.
- Raises:
- RuntimeError
If the index is not initialized or serialization fails.
- OverflowError
If the serialized payload is too large to fit in a Python bytes object.
See also
deserializeRestore an index from a serialized byte string.
on_disk_buildConfigure on-disk build mode.
Notes
“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.
“Canonical” blobs trade load time for portability: deserialization rebuilds the index with
n_jobs=1for deterministic reconstruction.
- set_params(**params) Annoy#
Set estimator-style parameters (scikit-learn compatibility).
- Parameters:
- **params
Keyword parameters to set. Unknown keys raise
ValueError.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- ValueError
If an unknown parameter name is provided.
- TypeError
If parameter names are not strings or types are invalid.
See also
get_paramsReturn estimator-style parameters.
Notes
Changing structural parameters (notably
metric) on an already initialized index resets the index deterministically (drops all items, trees, andy). Refit/rebuild is required before querying.This behavior matches scikit-learn expectations:
set_paramsmay be called at any time, but parameter changes that affect learned state invalidate the fitted model.
- set_seed(seed=None)#
Set the random seed used for tree construction.
- Parameters:
- seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value
0resets to Annoy’s core default seed (with aUserWarning).If omitted (or None, NULL), the seed is set to Annoy’s default seed.
If 0, clear any pending override and reset to Annoy’s default seed (a
UserWarningis emitted).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
seedParameter attribute (int | None).
Notes
Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.
- set_verbose(level=1)#
Set the verbosity level (callable setter).
This method exists to preserve a callable interface while keeping the parameter name
verboseavailable as an attribute for scikit-learn compatibility.- Parameters:
- levelint, optional, default=1
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
verboseParameter attribute (int | None).
set_verbosityAlias of
set_verbose.get_params,set_paramsEstimator parameter API.
- set_verbosity(level=1)#
Alias of
set_verbose.See also
verboseParameter attribute (int | None).
set_verboseSet the verbosity level (callable setter).
- transform(X, \*, n_neighbors=5, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None, input_type='vector', output_type='vector', exclude_self=False, exclude_items=None, missing_value=None)#
Transform queries into nearest-neighbor results (ids or vectors; optional distances / labels).
- Parameters:
- Xarray-like
Query inputs. The expected shape/type depends on
input_type:input_type=’item’ : X must be a 1D sequence of item ids.
input_type=’vector’: X must be a 2D array-like of shape (n_queries, f).
- n_neighborsint or None, default=5
Number of neighbors to retrieve for each query. For backwards compatibility this keyword is accepted, but it must match the estimator parameter
n_neighbors(STRICT schema).- search_kint, default=-1
Search parameter passed to Annoy (-1 uses Annoy’s default).
- include_distancesbool, default=False
If True, also return per-neighbor distances.
- return_labelsbool, default=False
If True, also return per-neighbor labels resolved from
y(as set viafit).- y_fill_valueobject, default=None
Value used when
yis unset or missing an entry for a neighbor id.- input_type{‘vector’, ‘item’}, default=’vector’
Controls how X is interpreted.
- output_type{‘vector’, ‘item’}, default=’vector’
Controls what neighbors are returned.
output_type=’item’: return neighbor ids.
output_type=’vector’: return neighbor vectors.
- exclude_selfbool, default=False
If True, exclude the query item id from results. Only valid when input_type=’item’.
- exclude_itemssequence of int or None, default=None
Explicit neighbor ids to exclude from results.
- missing_valuefloat or None, default=None
If not None, imputes missing entries in X (None values in dense rows; missing keys / None values in dict rows). If None, missing entries raise.
- Returns:
- neighborslist
Neighbor results for each query. - output_type=’item’ : list of list of int - output_type=’vector’: list of list of list of float
- (neighbors, distances)tuple
Returned when include_distances=True.
- (neighbors, labels)tuple
Returned when return_labels=True.
- (neighbors, distances, labels)tuple
Returned when include_distances=True and return_labels=True.
See also
get_nns_by_itemNeighbor search by item id.
get_nns_by_vectorNeighbor search by query vector.
fitBuild the index from
X(preferred if you already haveXavailable).fit_transformEstimator-style APIs.
Notes
Excluding self is performed by matching neighbor ids to the query id (not by checking distance values).
For input_type=’vector’, exclude_self=True is an error; use exclude_items for explicit, deterministic filtering.
If exclusions prevent returning exactly
n_neighborsresults, this method raises ValueError.
Examples
Item queries (exclude the query id itself):
>>> idx.transform([10, 20], input_type='item', output_type='item', n_neighbors=5, exclude_self=True)
Vector queries (exclude explicit ids):
>>> idx.transform(X_query, input_type='vector', output_type='item', n_neighbors=5, exclude_items=[10, 20])
Return neighbor vectors:
>>> idx.transform([10], input_type='item', output_type='vector', n_neighbors=5, exclude_self=True)
- unbuild()#
Discard the current forest, allowing new items to be added.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
Notes
After calling
unbuild, you must callbuildagain before running nearest-neighbour queries.
- unload()#
Unmap any memory-mapped file backing this index.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
loadMemory-map an on-disk index into this object.
on_disk_buildConfigure on-disk build mode.
Notes
This releases OS-level resources associated with the mmap, but keeps the Python object alive.
- verbose#
set_verbose().
- Type:
Verbosity level in [-2, 2] or None (unset). Callable setter
- y#
Labels / targets associated with the index items.
Notes
If provided to fit(X, y), labels are stored here after a successful build. You may also set this property manually. When possible, the setter enforces that len(y) matches the current number of items (n_items).