Index#
- class scikitplot.annoy.Index[source]#
High-level ANNoy index composed from mixins.
- Parameters:
- fint or None, optional, default=None
Vector dimension. If
0orNone, dimension may be inferred from the first vector passed toadd_item(lazy mode). If None, treated as0(reset to default).- metric{“angular”, “cosine”, “euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”} or None, optional, default=None
Distance metric (one of ‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’). If omitted and
f > 0, defaults to'angular'(cosine-like). If omitted andf == 0, metric may be set later before construction. If None, behavior depends onf:If
f > 0: defaults to'angular'(legacy behavior; may emit a
FutureWarning). * Iff == 0: leaves the metric unset (lazy). You may setmetriclater before construction, or it will default to'angular'on firstadd_item.- n_neighborsint, default=5
Non-negative integer Number of neighbors to retrieve for each query.
- on_disk_pathstr or None, optional, default=None
If provided, configures the path for on-disk building. When the underlying index exists, this enables on-disk build mode (equivalent to calling
on_disk_buildwith the same filename).Note: Annoy core truncates the target file when enabling on-disk build. This wrapper treats
on_disk_pathas strictly equivalent to callingon_disk_buildwith the same filename (truncate allowed).In lazy mode (
f==0and/ormetric is None), activation occurs once the underlying C++ index is created.- prefaultbool or None, optional, default=None
If True, request page-faulting index pages into memory when loading (when supported by the underlying platform/backing). If None, treated as
False(reset to default).- seedint or None, optional, default=None
Non-negative integer seed. If set before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value
0is treated as “use Annoy’s deterministic default seed” (aUserWarningis emitted when0is explicitly provided).- verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- schema_versionint, optional, default=None
Serialization/compatibility strategy marker.
This does not change the Annoy on-disk format, but it does control how the index is snapshotted in pickles.
0or1: pickle stores aportable-v1snapshot (fast restore,
ABI-checked). *
2: pickle storescanonical-v1(portable across ABIs; restores by rebuilding deterministically). *>=3: pickle stores both portable and canonical (canonical is used as a fallback if the ABI check fails).If None, treated as
0(reset to default).
- Attributes:
fint, default=0Vector dimension.
metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’}, default=”angular”Distance metric for the index.
n_neighborsint, default=5Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).
on_disk_pathstr or None, optional, default=NonePath used for on-disk build/load/save operations.
- seed, random_stateint or None, optional, default=None
Non-negative integer seed.
verboseint or None, optional, default=NoneVerbosity level in [-2, 2] or None (unset).
prefaultbool, default=FalseDefault prefault flag stored on the object.
schema_versionint, default=0Serialization/compatibility strategy marker sentinel value.
- n_features, n_features_, n_features_in_int
Alias of
f(dimension), provided for scikit-learn naming parity.n_features_out_intNumber of output features produced by transform (SLEP013).
feature_names_in_list-likeInput feature names seen during fit (SLEP007).
ydict | None, optional, default=NoneLabels / targets associated with the index items.
- pickle_modePickleMode
Pickle strategy used by
PickleMixin.- compress_modeCompressMode or None
Optional compression used by
PickleMixinwhen serializing to bytes.
Notes
This class is a direct subclass of the C-extension backend. It does not override
__new__and does not rely on cooperative initialization across mixins. Mixins must be written so that their methods work even if they define no__init__at all.- add_item(i, vector)#
Add a single embedding vector to the index.
- Parameters:
- iint
Item id (index) must be non-negative. Ids may be non-contiguous; the index allocates up to
max(i) + 1.- vectorsequence of float
1D embedding of length
f. Values are converted tofloat. Iff == 0and this is the first item,fis inferred fromvectorand then fixed for the lifetime of this index.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
buildBuild the forest after adding items.
unbuildRemove trees to allow adding more items.
get_nns_by_item,get_nns_by_vectorQuery nearest neighbours.
Notes
Items must be added before calling
build. After building the forest, further calls toadd_itemare not supported.Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> f=100 >>> n=1000 >>> idx = AnnoyIndex(f, metric='l2') ... >>> for i in range(n): ... v = [random.gauss(0, 1) for _ in range(f)] ... idx.add_item(i, v)
- add_items(X, ids=None, *, start_id=None, accept_sparse='error', ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, order='C', check_unique_ids=True)[source]#
Add many vectors to the index.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Vectors to add.
- idsarray-like of shape (n_samples,), optional
Explicit integer ids. If omitted, ids are allocated as a contiguous range starting at
start_id(orget_n_items()at call time).- start_idint, optional
Starting id used when
idsis None. If None, defaults tobackend.get_n_items()at call time.- accept_sparse{‘error’, ‘toarray’}, default=’error’
Sparse input handling.
'toarray'densifies SciPy sparse inputs explicitly. Any other sparse behavior raises.- ensure_all_finitebool or ‘allow-nan’, default=True
Finiteness validation policy.
- copybool, default=False
If True, copy the validated dense array before adding.
- dtypenumpy dtype, default=numpy.float32
Dtype passed to the backend.
- order{‘C’, ‘F’, ‘A’, ‘K’}, default=’C’
Memory order used when coercing
X.- check_unique_idsbool, default=True
If True, require ids to be unique.
- Returns:
- ids_outnumpy.ndarray of shape (n_samples,)
The ids that were added, as
int64.
- Raises:
- RuntimeError
If the backend indicates the index is built.
- TypeError
If sparse input is given while
accept_sparse='error'.- ValueError
If
Xis not 2D, feature dimensions mismatchf, ids are invalid, or finiteness policy is violated.
- Parameters:
- Return type:
See also
get_item_vectorsFetch vectors by id selection.
to_numpyExport vectors as a dense NumPy array.
Notes
This method is deterministic: ids are generated predictably and vectors are added in row order.
- property backend: Annoy#
Public alias for
_backend.- Returns:
- backendscikitplot.cexternals._annoy.Annoy
Low-level Annoy backend instance.
- build(n_trees, n_jobs=-1)#
Build a forest of random projection trees.
- Parameters:
- n_treesint
Number of trees in the forest. Larger values typically improve recall at the cost of slower build time and higher memory usage.
If set to
n_trees=-1, trees are built dynamically until the index reaches approximately twice the number of items_n_nodes >= 2 * n_items.Guidelines:
Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.
- n_jobsint, optional, default=-1
Number of threads to use while building.
-1means “auto” (use the implementation’s default, typically all available CPU cores).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
fitBuild the index from
X(preferred if you already haveXavailable).add_itemAdd vectors before building.
unbuildDrop trees to add more items.
rebuildReturn a new Annoy index rebuilt from the current index contents.
on_disk_buildConfigure on-disk build mode.
get_nns_by_item,get_nns_by_vectorQuery nearest neighbours.
save,loadPersist the index to/from disk.
Notes
After
buildcompletes, the index becomes read-only for queries. To add more items, callunbuild, add items, and then rebuild.References
[1]Erik Bernhardsson, “Annoy: Approximate Nearest Neighbours in C++/Python”.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> f=100 >>> n=1000 >>> idx = AnnoyIndex(f, metric='l2') ... >>> for i in range(n): ... v = [random.gauss(0, 1) for _ in range(f)] ... idx.add_item(i, v) >>> idx.build(10)
- deserialize(byte, prefault=None)#
Restore the index from a serialized byte string.
- Parameters:
- bytebytes
Byte string produced by
serialize. Both native (legacy) blobs and portable blobs (created withserialize(format='portable')) are accepted; portable and canonical blobs are auto-detected. Canonical blobs restore by rebuilding the index deterministically.- prefaultbool or None, optional, default=None
Accepted for API symmetry with
load. If None, the stored Ignored for canonical blobs.prefaultvalue is used.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If deserialization fails due to invalid or incompatible data.
- RuntimeError
If the index is not initialized.
See also
serializeCreate a binary snapshot of the index.
on_disk_buildConfigure on-disk build mode.
Notes
Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.
- f#
Vector dimension.
- Returns:
- int
Dimension of each item vector.
0means unknown / lazy.
Notes
Annoy(f=None, ...)is supported at construction time and is treated asf=0.0(orNone) means “unknown / lazy”: the first call toadd_itemwill inferffrom the input vector length and then fix it.
Changing
fafter the index has been initialized (items added and/or trees built) is a structural change: the stored items and all tree splits depend on the vector dimension.For scikit-learn compatibility, assigning a different
f(orNone) on an already initialized index will deterministically reset the index (drop all items, trees, andy). You must callfit(oradd_item+build) again before querying.
- feature_names_in_#
Input feature names seen during fit (SLEP007). Set only when explicitly provided via fit(…, feature_names=…).
- fit(X=None, y=None, *, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None)#
Fit the Annoy index (scikit-learn compatible).
This method supports two deterministic workflows:
Manual add/build: If X is None and y is None, fit() builds the forest using items previously added via add_item().
Array-like X: If X is provided (2D array-like), fit() optionally resets or appends, adds all rows as items, then builds the forest.
- Parameters:
- Xarray-like of shape (n_samples, n_features), default=None
Vectors to add to the index. If None (and y is None), fit() only builds.
- yarray-like of shape (n_samples,), default=None
Optional labels associated with X. Stored as
yafter successful build.- n_treesint, default=-1
Number of trees to build. Use -1 for Annoy’s internal default.
- n_jobsint, default=-1
Number of threads to use during build (-1 means “auto”).
- resetbool, default=True
If True, clear existing items before adding X. If False, append.
- start_indexint or None, default=None
Item id for the first row of X. If None, uses 0 when reset=True, otherwise uses current n_items when reset=False.
- missing_valuefloat or None, default=None
If not None, imputes missing entries in X.
Dense rows: replaces None elements with missing_value.
Dict rows: fills missing keys (and None values) with missing_value.
If None, missing entries raise an error (strict mode).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
fit_transformEstimator-style APIs.
transformQuery the built index.
add_itemAdd one item at a time.
buildBuild the forest after manual calls to add_item.
on_disk_buildConfigure on-disk build mode.
unbuildRemove trees so items can be appended.
yStored labels
y(if provided).get_params,set_paramsEstimator parameter API.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> n, f = 10_000, 1_000 >>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)] >>> q = [[random.gauss(0, 1) for _ in range(f)]] ... >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... idx = AnnoyIndex().set_params(metric=m).fit(X) ... print(m, idx.transform(q)) ... >>> idx = AnnoyIndex().fit(X) >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... idx_m = base.rebuild(metric=m) # rebuild-from-index ... print(m, idx_m.transform(q)) # no .fit(X) here
- fit_transform(X, y=None, *, n_trees=-1, n_jobs=-1, reset=True, start_index=None, missing_value=None, feature_names=None, n_neighbors=None, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None)#
Fit the index and transform X in a single deterministic call.
- This is equivalent to:
self.fit(X, y=y, n_trees=…, n_jobs=…, reset=…, start_index=…, missing_value=…) self.transform(X, n_neighbors=…, search_k=…, include_distances=…, return_labels=…, y_fill_value=…, missing_value=…)
See also
fitBuild the index from
X(preferred if you already haveXavailable).transformQuery the built index.
on_disk_buildConfigure on-disk build mode.
Examples
>>> import random >>> from scikitplot.cexternals._annoy import Annoy, AnnoyIndex ... >>> n, f = 10_000, 1_000 >>> X = [[random.gauss(0, 1) for _ in range(f)] for _ in range(n)] >>> q = [[random.gauss(0, 1) for _ in range(f)]] ... >>> for m in ['angular', 'l1', 'l2', '.', 'hamming']: ... print(m, AnnoyIndex().set_params(metric=m).fit_transform(q))
- classmethod from_bytes(data, *, f=None, metric=None, prefault=None)[source]#
Construct a new index and load it from serialized bytes.
- Parameters:
- data
Bytes produced by
to_bytes(backendserialize).- f
Vector dimension for construction.
- metric
Metric name for construction.
- prefault
Forwarded to the backend
deserializeif supported.
- Returns:
- index
Newly constructed index with the data loaded.
- Raises:
- TypeError
If
datais not bytes-like.- ValueError
If
formetricis invalid.- AttributeError
If the backend does not provide
deserialize.
- Parameters:
- Return type:
Notes
Portable blobs add a small header (version, ABI sizes, endianness, metric, f) to ensure incompatible binaries fail loudly and safely. They are not a cross-architecture wire format; the payload remains Annoy’s native snapshot.
For
dataif fedto_bytes(format='native') required params ``f`,metric.
- classmethod from_low_level(obj, *, prefault=None)[source]#
Create a new
Indexfrom a low-level instance.The new object is rebuilt by round-tripping through Annoy’s native
serialize/deserializeto avoid sharing low-level state between two Python objects.- Parameters:
- objscikitplot.cexternals._annoy.Annoy
Low-level Annoy instance.
- prefaultbool or None, default=None
Prefault override passed to
deserialize. If None, the value is taken fromobj.get_params(deep=False)when available, otherwise it falls back toobj.prefault/ destination defaults.
- Returns:
- indexIndex
Newly constructed high-level index.
- Raises:
- TypeError
If
objis not an Annoy instance.- RuntimeError
If serialization or deserialization fails, or required configuration (e.g.,
f) cannot be determined.
- Parameters:
- Return type:
Notes
The implementation uses Annoy’s native serialization. It does not attempt to copy internal pointers or C++ state directly.
This method is deterministic. It always constructs a new index from the serialized payload; it does not share low-level state between objects.
- classmethod from_metadata(metadata, *, load=True)[source]#
Construct an index from a metadata payload.
- Parameters:
- metadataMapping[str, Any]
Payload as produced by
to_metadata.- loadbool, default=True
If True and
params['on_disk_path']is present, attempt to load the index into the returned object via backendload.
- Returns:
- indexSelf
Newly constructed index.
- Raises:
- TypeError
If input types are invalid.
- ValueError
If required fields are missing or invalid.
- RuntimeError
If schema version is missing on the class.
- AttributeError
If backend
set_params/loadare missing when required.
- Parameters:
- Return type:
See also
- classmethod from_yaml(path, *, load=True)[source]#
Load metadata from YAML and construct an index (requires PyYAML).
- get_distance(i, j) float#
Return the distance between two stored items.
- Parameters:
- i, jint
Item ids (index) of two stored samples.
- Returns:
- dfloat
Distance between items
iandjunder the current metric.
- Raises:
- RuntimeError
If the index is not initialized.
- IndexError
If either index is out of range.
- get_feature_names_out(input_features=None)#
Get output feature names for the transformer-style API.
- Parameters:
- input_featuressequence of str or None, optional, default=None
If provided, validated deterministically against the fitted input feature names (if available) and the expected input dimensionality.
- Returns:
- tuple of str
Output feature names:
('neighbor_0', ..., 'neighbor_{k-1}')wherek == n_neighbors.
- Raises:
- AttributeError
If called before
fit/build.- ValueError
If
input_featuresis provided but does not matchfeature_names_in_.
- get_item_vector(i) list[float]#
Return the stored embedding vector for a given item id.
- Parameters:
- iint
Item id (index) previously passed to
add_item.
- Returns:
- vectorlist[float]
Stored embedding of length
f.
- Raises:
- RuntimeError
If the index is not initialized.
- IndexError
If
iis out of range.
- get_item_vectors(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, return_ids=False, validate_vector_len=True)[source]#
Fetch many vectors as a dense NumPy array.
- Parameters:
- idssequence of int or iterable of int, optional
Ids to fetch. If None, selects
range(start, stop or n_items).- dtypenumpy dtype, default=numpy.float32
Output dtype.
- start, stopint, optional
Range selection used when
idsis None.- n_rowsint, optional
Required when
idsis a non-sized iterable (e.g., generator).- return_idsbool, default=False
If True, also return the realized ids (int64) in row order.
- validate_vector_lenbool, default=True
If True, verify every fetched vector has length
f.
- Returns:
- Xnumpy.ndarray of shape (n_rows, f)
Dense matrix of vectors.
- ids_outnumpy.ndarray of shape (n_rows,), optional
Returned when
return_ids=True.
- Raises:
- ValueError
If the id selection is inconsistent or vectors have unexpected length.
- TypeError
If
idsis a non-sized iterable andn_rowsis not provided.
- Parameters:
- Return type:
See also
to_numpyDense NumPy export alias.
iter_item_vectorsStreaming export without allocating a dense matrix.
- get_n_items() int#
Return the number of stored items in the index.
- Returns:
- n_itemsint
Number of items that have been added and are currently addressable.
- Raises:
- RuntimeError
If the index is not initialized.
- get_n_trees() int#
Return the number of trees in the current forest.
- Returns:
- n_treesint
Number of trees that have been built.
- Raises:
- RuntimeError
If the index is not initialized.
- get_nns_by_item(i, n, search_k=-1, include_distances=False)#
Return the
nnearest neighbours for a stored item id.- Parameters:
- iint
Item id (index) previously passed to
add_item(i, embedding).- nint
Number of nearest neighbours to return.
- search_kint, optional, default=-1
Maximum number of nodes to inspect. Larger values usually improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional, default=False
If True, return a
(indices, distances)tuple. Otherwise return only the list of indices.
- Returns:
- indiceslist[int] | tuple[list[int], list[float]]
If
include_distances=False: list of neighbour item ids. Ifinclude_distances=True:(indices, distances).
- Raises:
- RuntimeError
If the index is not initialized or has not been built.
- IndexError
If
iis out of range.
See also
get_nns_by_vectorQuery with an explicit query embedding.
- get_nns_by_vector(vector, n, search_k=-1, include_distances=False)#
Return the
nnearest neighbours for a query embedding.- Parameters:
- vectorsequence of float
Query embedding of length
f.- nint
Number of nearest neighbours to return.
- search_kint, optional, default=-1
Maximum number of nodes to inspect. Larger values typically improve recall at the cost of slower queries. If
-1, defaults to approximatelyn_trees * n.- include_distancesbool, optional, default=False
If True, return a
(indices, distances)tuple. Otherwise return only the list of indices.
- Returns:
- indiceslist[int] | tuple[list[int], list[float]]
If
include_distances=False: list of neighbour item ids. Ifinclude_distances=True:(indices, distances).
- Raises:
- RuntimeError
If the index is not initialized or has not been built.
- ValueError
If
len(vector) != f.
See also
get_nns_by_itemQuery by stored item id.
- get_params(deep=True) dict#
Return estimator-style parameters (scikit-learn compatibility).
- Parameters:
- deepbool, optional, default=True
Included for scikit-learn API compatibility. Ignored because Annoy does not contain nested estimators.
- Returns:
- paramsdict
Dictionary of stable, user-facing parameters.
See also
set_paramsSet estimator-style parameters.
schema_versionControls pickle / snapshot strategy.
Notes
This is intended to make Annoy behave like a scikit-learn estimator for tools such as
sklearn.base.cloneand parameter grids.
- info(include_n_items=True, include_n_trees=True, include_memory=None) dict#
Return a structured summary of the index.
This method returns a JSON-like Python dictionary that is easier to inspect programmatically than the legacy multi-line string format.
- Parameters:
- include_n_itemsbool, optional, default=True
If True, include
n_items.- include_n_treesbool, optional, default=True
If True, include
n_trees.- include_memorybool or None, optional, default=None
Controls whether memory usage fields are included.
None: include memory usage only if the index is built.True: include memory usage if available (built).False: omit memory usage fields.
Memory usage is computed after
buildand may be expensive for very large indexes.
- Returns:
- infodict
Dictionary describing the current index state.
See also
serializeCreate a binary snapshot of the index.
deserializeRestore from a binary snapshot.
savePersist the index to disk.
loadLoad the index from disk.
Notes
Some keys are optional depending on include_* flags.
Keys:
- fint, default=0
Dimensionality of the index.
- metricstr, default=’angular’
Distance metric name.
- on_disk_pathstr, default=’’
Path used for on-disk build, if configured.
- prefaultbool, default=False
If True, aggressively fault pages into memory during save. Primarily useful on some platforms for very large indexes.
- schema_versionint, default=0
Stored schema/version marker on this object (reserved for future use).
- seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created.
- verboseint or None, optional, default=None
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
Optional Keys:
- n_itemsint
Number of items currently stored.
- n_treesint
Number of built trees in the forest.
- memory_usage_byteint
Approximate memory usage in bytes. Present only when requested and available.
- memory_usage_mibfloat
Approximate memory usage in MiB. Present only when requested and available.
Examples
>>> info = idx.info() >>> info['f'] 100 >>> info['n_items'] 1000
- iter_item_vectors(ids=None, *, start=0, stop=None, with_ids=True, dtype=None)[source]#
Iterate vectors without allocating a dense matrix.
- Parameters:
- ids, start, stop
Selection controls. See
get_item_vectors.- with_idsbool, default=True
If True, yield
(id, vector). If False, yield vectors only.- dtypenumpy dtype, optional
If provided, cast output vectors to this dtype.
- Yields:
- (id, vector) or vector
Each vector is returned as a 1D NumPy array.
- Parameters:
- Return type:
See also
get_item_vectorsDense export.
- kneighbors(X, n_neighbors=5, *, search_k=-1, include_distances=True, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='vector')[source]#
Find k nearest neighbors for one or more query vectors.
This is a sklearn-like convenience wrapper that returns rectangular arrays.
- Parameters:
- Xarray-like of shape (f,) or (n_queries, f)
Query vector(s).
- n_neighborsint, default=5
Number of neighbors to return per query.
- search_kint, default=-1
Search parameter forwarded to the backend.
- include_distancesbool, default=True
If True, return
(neighbors, distances). Otherwise return neighbors.- exclude_selfbool, default=False
If True, apply the same deterministic self-exclusion rule as
query_by_vectorfor each query row.- exclude_item_idsiterable of int, optional
Exclude these ids for every query.
- ensure_all_finitebool or ‘allow-nan’, default=True
Input validation option forwarded to scikit-learn.
- copybool, default=False
Input validation option forwarded to scikit-learn.
- output_type{‘item’, ‘vector’}, default=’vector’
If ‘item’, return neighbor ids. If ‘vector’, return neighbor vectors.
- Returns:
- neighborsnumpy.ndarray
If
output_type='item', shape is(n_queries, n_neighbors). Ifoutput_type='vector', shape is(n_queries, n_neighbors, f).- distancesnumpy.ndarray of shape (n_queries, n_neighbors)
Neighbor distances. Returned when
include_distances=True.
- Raises:
- sklearn.exceptions.NotFittedError
If the backend reports that the index is unbuilt.
- ValueError
If
n_neighbors <= 0or any query yields too few neighbors after exclusions.
- Parameters:
- Return type:
See also
query_by_vectorPer-query 1D interface.
kneighbors_graphCSR kNN graph.
- kneighbors_graph(X, n_neighbors=5, *, search_k=-1, mode='connectivity', exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, output_type='item')[source]#
Compute the k-neighbors graph (CSR) for query vectors.
- Parameters:
- Xarray-like of shape (f,) or (n_queries, f)
Query vector(s).
- n_neighborsint, default=5
Number of neighbors per query.
- search_kint, default=-1
Search parameter forwarded to the backend.
- mode{‘connectivity’, ‘distance’}, default=’connectivity’
If ‘connectivity’, graph entries are 1. If ‘distance’, entries are backend distances.
- exclude_selfbool, default=False
If True, apply the same deterministic self-exclusion rule as
kneighborsfor each query row.- exclude_item_idsiterable of int, optional
Exclude these ids for every query.
- ensure_all_finitebool or ‘allow-nan’, default=True
Input validation option forwarded to scikit-learn.
- copybool, default=False
Input validation option forwarded to scikit-learn.
- output_type{‘item’}, default=’item’
Must be ‘item’ for CSR construction.
- Returns:
- graphscipy.sparse.csr_matrix
CSR matrix of shape
(n_queries, n_items).
- Raises:
- ImportError
If SciPy is not installed.
- ValueError
If
modeis invalid oroutput_type != 'item'.- RuntimeError
If the backend returns an out-of-range neighbor id.
- Parameters:
- Return type:
See also
kneighborsDense kNN results.
- load(fn, prefault=None)#
Load (mmap) an index from disk into the current object.
- Parameters:
- fnstr
Path to a file previously created by
saveoron_disk_build.- prefaultbool or None, optional, default=None
If True, fault pages into memory when the file is mapped. If None, use the stored
prefaultvalue. Primarily useful on some platforms for very large indexes.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If the file cannot be opened or mapped.
- RuntimeError
If the index is not initialized or the file is incompatible.
See also
saveSave the current index to disk.
on_disk_buildBuild directly using an on-disk backing file.
unloadRelease mmap resources.
Notes
The in-memory index must have been constructed with the same dimension and metric as the on-disk file.
- classmethod load_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#
Load a directory bundle created by
save_bundle.- Parameters:
- manifest_filename
Filename for the metadata manifest inside the directory.
- index_filename
Filename for the Annoy index inside the directory.
- prefault
Forwarded to
load_index.
- Returns:
- index
Newly constructed index.
- Raises:
- Parameters:
- Return type:
- classmethod load_index(f, metric, path, *, prefault=None)[source]#
Load (mmap) an Annoy index file into this object.
- Parameters:
- f
Vector dimension for construction.
- metric
Metric name for construction.
- pathstr or os.PathLike
Path to a file previously created by
save_indexor the backendsave.- prefault
Forwarded to the backend. If
None, the backend default is used.
- Raises:
- AttributeError
If the backend does not provide
load(path, prefault=...).- OSError
If loading fails (backend or filesystem).
- Parameters:
- Return type:
- memory_usage() int#
Approximate memory usage of the index in bytes.
- Returns:
- n_bytesint or None
Approximate number of bytes used by the index. Returns
Noneif the index is not initialized or the forest has not been built yet.
- Raises:
- RuntimeError
If memory usage cannot be computed.
- metric#
Distance metric for the index. Valid values:
‘angular’ -> Cosine-like distance on normalized vectors.
‘euclidean’ -> L2 distance.
‘manhattan’ -> L1 distance.
‘dot’ -> Negative dot-product distance (inner product).
‘hamming’ -> Hamming distance for binary vectors.
Aliases (case-insensitive):
angular : cosine
euclidean : l2, lstsq
manhattan : l1, cityblock, taxicab
dot : @, ., dotproduct, inner, innerproduct
hamming : hamming
- Returns:
- str or None
Canonical metric name, or None if not configured yet.
Notes
Changing
metricafter the index has been initialized (items added and/or trees built) is a structural change: the forest and all distances depend on the distance function.For scikit-learn compatibility, setting a different metric on an already initialized index will deterministically reset the index (drop all items, trees, and
y). You must callfit(oradd_item+build) again before querying.
- n_features#
Alias of
f(dimension), provided for scikit-learn naming parity.
- n_features_#
Read-only alias of
n_features_in_.
- n_features_in_#
Number of features seen during fit (scikit-learn compatible). Alias of
fwhen available.
- n_features_out_#
Number of output features produced by transform (SLEP013). Equals n_neighbors once fitted.
- n_neighbors#
Number of neighbors returned by transform/fit_transform (SLEP013; strict schema).
- on_disk_build(fn)#
Configure the index to build using an on-disk backing file.
- Parameters:
- fnstr
Path to a file that will hold the index during build. The file is created or overwritten as needed.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
Notes
This mode is useful for very large datasets that do not fit comfortably in RAM during construction.
- on_disk_path#
Path used for on-disk build/load/save operations.
- Returns:
- str or None
Filesystem path used for on-disk operations, or None if not configured.
See also
Notes
Assigning a string/PathLike to
on_disk_pathconfigures on-disk build mode (equivalent to callingon_disk_buildwith the same filename).Note: Annoy core truncates the target file when enabling on-disk build.
on_disk_pathis strictly equivalent to callingon_disk_buildwith the same filename (truncate allowed).Assigning
None(or an empty string) clears the configured path, but only when no disk-backed index is currently active.Clearing/changing this while an on-disk index is active is disallowed. Call
unloadfirst.
- plot_index(labels=None, *, ids=None, projection='pca', dims=(0, 1), center=True, maxabs=False, l2_normalize=False, dtype=<class 'numpy.float32'>, ax=None, title=None, plot_kwargs=None)[source]#
Plot this index as a 2D scatter plot.
This is a thin wrapper around
plot_annoy_indexthat uses_plotting_backend.- Parameters:
- labels, ids, projection, dims, center, maxabs, l2_normalize, dtype, ax, title, plot_kwargs
See
plot_annoy_index.
- Returns:
- y2, ids_out, ax
See
plot_annoy_index.
- Parameters:
- Return type:
See also
plot_annoy_indexLow-level plotting helper this method delegates to.
plot_knn_edgesOverlay kNN edges on the returned 2D coordinates.
Notes
This method does not mutate the index.
Plotting backends (e.g. Matplotlib) are imported lazily and are only required when this method is called.
The returned
ids_outcorresponds to the item id for each row iny2.
Examples
>>> import numpy as np >>> import scikitplot.annoy as skann >>> idx = skann.Index(f=10, metric="angular") >>> # ... add items & build ... >>> labels = np.zeros(idx.get_n_items(), dtype=int) >>> y2, ids, ax = idx.plot_index(labels=labels, projection="pca")
- plot_knn_edges(y2, *, ids=None, k=10, search_k=-1, ax=None, line_kwargs=None, undirected=True)[source]#
Overlay kNN edges onto an existing 2D index plot.
This is a thin wrapper around
plot_annoy_knn_edgesthat uses_plotting_backend.- Parameters:
- y2, ids, k, search_k, ax, line_kwargs, undirected
See
plot_annoy_knn_edges.
- Returns:
- ax
The axes that were drawn on.
- Parameters:
- Return type:
See also
plot_annoy_knn_edgesLow-level edge overlay helper this method delegates to.
plot_indexComputes the 2D coordinates used as input to this method.
Notes
y2must represent 2D coordinates with shape(n_samples, 2).If
idsis provided, it must have lengthn_samples.This method does not mutate the index; it only performs neighbor queries to draw edges.
Examples
>>> y2, ids, ax = idx.plot_index(labels=np.zeros(idx.get_n_items(), dtype=int)) >>> idx.plot_knn_edges(y2, ids=ids, k=5, line_kwargs={"alpha": 0.15})
- prefault#
Default prefault flag stored on the object.
This setting is used as the default for per-call
prefaultarguments whenprefaultis omitted or set toNonein methods likeloadandsave.- Returns:
- bool
Current prefault flag.
Notes
This flag does not retroactively change already-loaded mappings.
- query_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#
Query neighbors by stored item id.
- Parameters:
- itemint
Stored item id.
- n_neighborsint
Number of neighbors to return after applying exclusions.
- search_kint, default=-1
Search parameter forwarded to the backend.
- include_distancesbool, default=False
If True, also return distances.
- exclude_selfbool, default=False
If True, exclude
itemfrom the returned neighbors.- exclude_item_idsiterable of int, optional
Additional item ids to exclude.
- ensure_all_finitebool or ‘allow-nan’, default=True
Input validation option forwarded to scikit-learn.
- copybool, default=False
Input validation option forwarded to scikit-learn.
- Returns:
- indicesnumpy.ndarray of shape (n_neighbors,)
Neighbor ids.
- (indices, distances)tuple of numpy.ndarray
Returned when
include_distances=True.
- Raises:
- sklearn.exceptions.NotFittedError
If the backend reports that the index is unbuilt.
- ValueError
If
n_neighbors <= 0or not enough neighbors remain after exclusions.
- Parameters:
- Return type:
See also
query_by_vectorQuery neighbors by an explicit vector.
kneighborsBatch neighbor queries (sklearn-like).
Notes
Exclusions are applied deterministically in the order returned by the backend.
- query_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False)[source]#
Query neighbors by an explicit vector.
- Parameters:
- vectorarray-like of shape (f,)
Query vector.
- n_neighborsint
Number of neighbors to return after exclusions.
- search_kint, default=-1
Search parameter forwarded to the backend.
- include_distancesbool, default=False
If True, also return distances.
- exclude_selfbool, default=False
If True, exclude the first returned candidate whose distance is exactly
0.0. This is intended for queries wherevectorcomes from the index itself.- exclude_item_idsiterable of int, optional
Additional item ids to exclude.
- ensure_all_finitebool or ‘allow-nan’, default=True
Input validation option forwarded to scikit-learn.
- copybool, default=False
Input validation option forwarded to scikit-learn.
- Returns:
- indicesnumpy.ndarray of shape (n_neighbors,)
Neighbor ids.
- (indices, distances)tuple of numpy.ndarray
Returned when
include_distances=True.
- Raises:
- sklearn.exceptions.NotFittedError
If the backend reports that the index is unbuilt.
- ValueError
If
n_neighbors <= 0, vector dimension mismatchesf, or not enough neighbors remain after exclusions.
- Parameters:
- Return type:
See also
query_by_itemQuery neighbors by stored item id.
kneighborsBatch neighbor queries (sklearn-like).
Notes
Exclusions are applied deterministically in the order returned by the backend. If
exclude_self=Trueand no exact0.0distance candidate is returned in the first position, no additional self-exclusion is applied.
- query_vectors_by_item(item, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#
Query neighbor vectors by stored item id.
This is a convenience wrapper over
query_by_itemthat materializes vectors using the backend’sget_item_vector.- Parameters:
- item, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids
See
query_by_item.- ensure_all_finite, copy
See
query_by_vector.- dtypenumpy dtype, default=numpy.float32
Output dtype for the returned vectors.
- output_type{‘item’, ‘vector’}, default=’vector’
If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.
- Returns:
- vectorsnumpy.ndarray of shape (n_neighbors, f)
Neighbor vectors.
- (vectors, distances)tuple
Returned when
include_distances=True.
- Parameters:
- Return type:
See also
query_vectors_by_vectorVector query returning vectors (or ids).
- query_vectors_by_vector(vector, n_neighbors, *, search_k=-1, include_distances=False, exclude_self=False, exclude_item_ids=None, ensure_all_finite=True, copy=False, dtype=<class 'numpy.float32'>, output_type='vector')[source]#
Query neighbor vectors by an explicit vector.
Convenience wrapper over
query_by_vector. By default it returns vectors; setoutput_type='item'to return neighbor ids instead.- Parameters:
- vector, n_neighbors, search_k, include_distances, exclude_self, exclude_item_ids,
See
query_by_item.- ensure_all_finite, copy
See
query_by_vector.- dtypenumpy dtype, default=numpy.float32
Output dtype for the returned vectors.
- output_type{‘item’, ‘vector’}, default=’vector’
If ‘vector’, return neighbor vectors. If ‘item’, return neighbor ids.
- Returns:
- neighborsnumpy.ndarray
If
output_type='vector', an array of shape(n_neighbors, f). Ifoutput_type='item', an array of shape(n_neighbors,).- (neighbors, distances)tuple
Returned when
include_distances=True.
- Parameters:
- Return type:
See also
query_vectors_by_itemItem id query returning vectors.
query_by_vectorPer-query id interface.
- random_state#
Alias of
seed(scikit-learn convention).
- rebuild(metric=None, *, on_disk_path=None, n_trees=None, n_jobs=-1) Annoy#
Return a new Annoy index rebuilt from the current index contents.
This helper is intended for deterministic, explicit rebuilds when changing structural constraints such as the metric (Annoy uses metric-specific C++ index types). The source index is not mutated.
- Parameters:
- metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘dot’, ‘hamming’} or None, optional
Metric for the new index. If None, reuse the current metric.
- on_disk_pathpath-like or None, optional
Optional on-disk build path for the new index.
Safety: the source object’s on_disk_path is never carried over implicitly. If on_disk_path is provided and is string-equal to the source’s configured path, it is ignored to avoid accidental overwrite/truncation hazards.
- n_treesint or None, optional
If provided, build the new index with this number of trees (or -1 for Annoy’s internal auto mode). If None, reuse the source’s tree count only when the source index is already built; otherwise do not build.
- n_jobsint, optional, default=-1
Number of threads to use while building (-1 means “auto”).
- Returns:
See also
buildBuild trees after adding items (on-disk backed).
on_disk_buildConfigure on-disk build mode.
fitBuild the index from
X(preferred if you already haveXavailable).get_paramsRead constructor parameters.
set_paramsUpdate estimator parameters (use with
fit(X)when refitting from data).serialize,deserializePersist / restore indexes; canonical restores rebuild deterministically.
__sklearn_clone__Unfitted clone hook (no fitted state).
Notes
rebuild(metric=...)is deterministic and preserves item ids (0..n_items-1). by copying item vectors from the current fitted index into a new instance and rebuilding trees.Use
rebuild()when you want to changemetricwhile reusing the already-stored vectors (e.g., you do not want to re-read or re-materializeX, or you loaded an index from disk and only have access to its stored vectors).
- repr_info(include_n_items=True, include_n_trees=True, include_memory=None) str#
Return a dict-like string representation with optional extra fields.
Unlike
__repr__, this method can include additional fields on demand. Note thatinclude_memory=Truemay be expensive for large indexes. Memory is calculated afterbuild.
- save(fn, prefault=None)#
Persist the index to a binary file on disk.
- Parameters:
- fnstr
Path to the output file. Existing files will be overwritten.
- prefaultbool or None, optional, default=None
If True, aggressively fault pages into memory during save. If None, use the stored
prefaultvalue. Primarily useful on some platforms for very large indexes.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- IOError
If the file cannot be written.
- RuntimeError
If the index is not initialized or save fails.
See also
loadLoad an index from disk.
on_disk_buildConfigure on-disk build mode.
serializeSnapshot to bytes for in-memory persistence.
deserializeRestore an index from a serialized byte string.
Notes
The output file will be overwritten if it already exists. Use prefault=None to fall back to the stored
prefaultsetting.
- save_bundle(manifest_filename='manifest.json', index_filename='index.ann', *, prefault=None)[source]#
Save a directory bundle containing metadata + the index file.
The bundle contains: -
manifest.json: metadata payload produced byto_json-index.ann: Annoy index produced bysave_index- Parameters:
- manifest_filename
Filename for the metadata manifest inside the directory.
- index_filename
Filename for the Annoy index inside the directory.
- prefault
Forwarded to
save_index.
- Raises:
- Parameters:
- Return type:
- save_index(path, *, prefault=None)[source]#
Persist the Annoy index to disk.
- Parameters:
- pathstr or os.PathLike
Destination path for the Annoy index file.
- prefault
Forwarded to the backend. If
None, the backend default is used.
- Raises:
- AttributeError
If the backend does not provide
save(path, prefault=...).- OSError
For filesystem-level failures.
- Parameters:
- Return type:
- schema_version#
Serialization/compatibility strategy marker sentinel value.
This does not change the Annoy on-disk format, but it controls how the index is snapshotted in pickles.
- Returns:
- int
Current schema version marker.
Notes
0or1: pickle stores aportable-v1snapshot (fast restore, ABI-checked).2: pickle storescanonical-v1(portable; restores by rebuilding deterministically).>=3: pickle stores both portable and canonical; canonical is used as a fallback.
- seed#
Random seed override (scikit-learn compatible). None means use Annoy default seed.
- serialize(format=None) bytes#
Serialize the built in-memory index into a byte string.
- Parameters:
- format{“native”, “portable”, “canonical”} or None, optional, default=None
Serialization format.
“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.
- Returns:
- databytes
Opaque binary blob containing the Annoy index.
- Raises:
- RuntimeError
If the index is not initialized or serialization fails.
- OverflowError
If the serialized payload is too large to fit in a Python bytes object.
See also
deserializeRestore an index from a serialized byte string.
on_disk_buildConfigure on-disk build mode.
Notes
“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.
“Canonical” blobs trade load time for portability: deserialization rebuilds the index with
n_jobs=1for deterministic reconstruction.
- set_params(**params) Annoy#
Set estimator-style parameters (scikit-learn compatibility).
- Parameters:
- **params
Keyword parameters to set. Unknown keys raise
ValueError.
- Returns:
AnnoyThis instance (self), enabling method chaining.
- Raises:
- ValueError
If an unknown parameter name is provided.
- TypeError
If parameter names are not strings or types are invalid.
See also
get_paramsReturn estimator-style parameters.
Notes
Changing structural parameters (notably
metric) on an already initialized index resets the index deterministically (drops all items, trees, andy). Refit/rebuild is required before querying.This behavior matches scikit-learn expectations:
set_paramsmay be called at any time, but parameter changes that affect learned state invalidate the fitted model.
- set_seed(seed=None)#
Set the random seed used for tree construction.
- Parameters:
- seedint or None, optional, default=None
Non-negative integer seed. If called before the index is constructed, the seed is stored and applied when the C++ index is created. Seed value
0resets to Annoy’s core default seed (with aUserWarning).If omitted (or None, NULL), the seed is set to Annoy’s default seed.
If 0, clear any pending override and reset to Annoy’s default seed (a
UserWarningis emitted).
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
seedParameter attribute (int | None).
Notes
Annoy is deterministic by default. Setting an explicit seed is useful for reproducible experiments and debugging.
- set_verbose(level=1)#
Set the verbosity level (callable setter).
This method exists to preserve a callable interface while keeping the parameter name
verboseavailable as an attribute for scikit-learn compatibility.- Parameters:
- levelint, optional, default=1
Verbosity level. Values are clamped to the range
[-2, 2].level >= 1enables Annoy’s verbose logging;level <= 0disables it. Logging level inspired by gradient-boosting libraries:<= 0: quiet (warnings only)1: info (Annoy’sverbose=True)>= 2: debug (currently same as info, reserved for future use)
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
verboseParameter attribute (int | None).
set_verbosityAlias of
set_verbose.get_params,set_paramsEstimator parameter API.
- set_verbosity(level=1)#
Alias of
set_verbose.See also
verboseParameter attribute (int | None).
set_verboseSet the verbosity level (callable setter).
- to_bytes(format=None)[source]#
Serialize the built index to bytes (backend
serialize).- Parameters:
- format{“native”, “portable”, “canonical”} or None, optional, default=None
Serialization format. If
Noneused"canonical"“native” (legacy): raw Annoy memory snapshot. Fastest, but only compatible when the ABI matches exactly.
“portable”: prepend a small compatibility header (version, endianness, sizeof checks, metric, f) so deserialization fails loudly on mismatches.
“canonical”: rebuildable wire format storing item vectors + build parameters. Portable across ABIs (within IEEE-754 float32) and restores by rebuilding trees deterministically.
- Returns:
- data
Serialized index bytes.
- Raises:
- AttributeError
If the backend does not provide
serialize.- RuntimeError
If serialization fails.
- TypeError
If the backend returns non-bytes-like data.
- Return type:
Notes
“Portable” blobs are the native snapshot with additional compatibility guards. They are not a cross-architecture wire format.
“Canonical” blobs trade load time for portability: deserialization rebuilds the index with
n_jobs=1for deterministic reconstruction.
- to_json(path=None, *, indent=2, sort_keys=True, ensure_ascii=False, include_info=True, strict=True)[source]#
Serialize
to_metadatato JSON.- Parameters:
- path
If provided, write the JSON to this path atomically.
- indent
Indentation level passed to
json.dumps.- sort_keys
If True, sort keys for stable output.
- ensure_ascii
If True, escape non-ASCII characters.
- include_info, strict
Forwarded to
to_metadata.
- Returns:
- json_str
JSON representation of the metadata.
- Raises:
- TypeError
If the exported metadata contains non-JSON-serializable values.
- Parameters:
- Return type:
See also
- to_metadata(*, include_info=True, strict=True)[source]#
Export a serializable metadata payload.
- Parameters:
- include_info
If True, include an
info()mapping when available.- strict
If True, failures in optional
info()propagation raise.
- Returns:
- metadata
A JSON/YAML-serializable mapping containing configuration parameters and optional info.
- Raises:
- RuntimeError
If
_META_SCHEMA_VERSIONis missing on the concrete class.- TypeError
If
get_paramsdoes not return a mapping.- AttributeError
If neither the instance nor the backend implements
get_params.- TypeError
If a persistence knob (e.g.,
pickle_mode) is not JSON/YAML-serializable.
- Parameters:
- Return type:
IndexMetadata
- to_numpy(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#
Export vectors to a dense NumPy array.
See also
get_item_vectorsDense export with optional id output.
iter_item_vectorsStreaming export.
to_scipy_csrExport as SciPy CSR.
to_pandasExport as pandas DataFrame.
Notes
This is an alias of
get_item_vectorswithreturn_ids=False.
- to_pandas(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, id_location='index', id_name='id', columns=None, validate_vector_len=True)[source]#
Export vectors to a pandas
DataFrame.- Parameters:
- ids, start, stop, n_rows
Selection controls. See
get_item_vectors.- dtypenumpy dtype, default=numpy.float32
Output dtype.
- id_location{‘index’, ‘column’, ‘both’, ‘none’}, default=’index’
Where to place ids in the output.
- id_namestr, default=’id’
Name used for the id column / index.
- columnssequence of str, optional
Column names for vector dimensions. If None, uses
feature_names_in_when present and length matchesf; otherwise usesfeature_0..feature_{f-1}.- validate_vector_lenbool, default=True
If True, verify every fetched vector has length
f.
- Returns:
- dfpandas.DataFrame
DataFrame with shape
(n_rows, f)plus optional id metadata.
- Raises:
- ImportError
If pandas is not installed.
- ValueError
If
id_locationis invalid orcolumnslength mismatchesf.
- Parameters:
- Return type:
See also
to_numpyDense NumPy export.
to_scipy_csrExport as SciPy CSR.
- to_scipy_csr(ids=None, *, dtype=<class 'numpy.float32'>, start=0, stop=None, n_rows=None, validate_vector_len=True)[source]#
Export vectors as a SciPy CSR matrix.
- to_yaml(path=None, *, include_info=True, strict=True)[source]#
Serialize
to_metadatato YAML (requires PyYAML).
- transform(X, *, n_neighbors=5, search_k=-1, include_distances=False, return_labels=False, y_fill_value=None, input_type='vector', output_type='vector', exclude_self=False, exclude_items=None, missing_value=None)#
Transform queries into nearest-neighbor results (ids or vectors; optional distances / labels).
- Parameters:
- Xarray-like
Query inputs. The expected shape/type depends on
input_type:input_type=’item’ : X must be a 1D sequence of item ids.
input_type=’vector’: X must be a 2D array-like of shape (n_queries, f).
- n_neighborsint or None, default=5
Number of neighbors to retrieve for each query. For backwards compatibility this keyword is accepted, but it must match the estimator parameter
n_neighbors(STRICT schema).- search_kint, default=-1
Search parameter passed to Annoy (-1 uses Annoy’s default).
- include_distancesbool, default=False
If True, also return per-neighbor distances.
- return_labelsbool, default=False
If True, also return per-neighbor labels resolved from
y(as set viafit).- y_fill_valueobject, default=None
Value used when
yis unset or missing an entry for a neighbor id.- input_type{‘vector’, ‘item’}, default=’vector’
Controls how X is interpreted.
- output_type{‘vector’, ‘item’}, default=’vector’
Controls what neighbors are returned. - output_type=’item’: return neighbor ids. - output_type=’vector’: return neighbor vectors.
- exclude_selfbool, default=False
If True, exclude the query item id from results. Only valid when input_type=’item’.
- exclude_itemssequence of int or None, default=None
Explicit neighbor ids to exclude from results.
- missing_valuefloat or None, default=None
If not None, imputes missing entries in X (None values in dense rows; missing keys / None values in dict rows). If None, missing entries raise.
- Returns:
- neighborslist
Neighbor results for each query. - output_type=’item’ : list of list of int - output_type=’vector’: list of list of list of float
- (neighbors, distances)tuple
Returned when include_distances=True.
- (neighbors, labels)tuple
Returned when return_labels=True.
- (neighbors, distances, labels)tuple
Returned when include_distances=True and return_labels=True.
See also
get_nns_by_itemNeighbor search by item id.
get_nns_by_vectorNeighbor search by query vector.
fitBuild the index from
X(preferred if you already haveXavailable).fit_transformEstimator-style APIs.
Notes
Excluding self is performed by matching neighbor ids to the query id (not by checking distance values).
For input_type=’vector’, exclude_self=True is an error; use exclude_items for explicit, deterministic filtering.
If exclusions prevent returning exactly
n_neighborsresults, this method raises ValueError.
Examples
Item queries (exclude the query id itself):
>>> idx.transform([10, 20], input_type='item', output_type='item', n_neighbors=5, exclude_self=True)
Vector queries (exclude explicit ids):
>>> idx.transform(X_query, input_type='vector', output_type='item', n_neighbors=5, exclude_items=[10, 20])
Return neighbor vectors:
>>> idx.transform([10], input_type='item', output_type='vector', n_neighbors=5, exclude_self=True)
- unbuild()#
Discard the current forest, allowing new items to be added.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
Notes
After calling
unbuild, you must callbuildagain before running nearest-neighbour queries.
- unload()#
Unmap any memory-mapped file backing this index.
- Returns:
AnnoyThis instance (self), enabling method chaining.
See also
loadMemory-map an on-disk index into this object.
on_disk_buildConfigure on-disk build mode.
Notes
This releases OS-level resources associated with the mmap, but keeps the Python object alive.
- verbose#
set_verbose().
- Type:
Verbosity level in [-2, 2] or None (unset). Callable setter
- y#
Labels / targets associated with the index items.
Notes
If provided to fit(X, y), labels are stored here after a successful build. You may also set this property manually. When possible, the setter enforces that len(y) matches the current number of items (n_items).