Index#
- class scikitplot.annoy._annoy.Index(int f: int | None = None, str metric: str | None = None, int n_neighbors: int = 5, *, str on_disk_path: str | None = None, bool prefault: bool = False, int seed: int | None = None, int verbose: int | None = None, int schema_version: int = 0, str dtype: str = 'float32', str index_dtype: str = 'int32', str wrapper_dtype: str = 'uint64', str random_dtype: str = 'uint64', int n_jobs: int | None = None, **kwargs)[source]#
Annoy Approximate Nearest Neighbors Index.
This is a Cython-powered Python wrapper around the Annoy C++ library.
- Parameters:
- fint or None, default=None
Embedding dimension. If 0 or None, dimension is inferred from first vector added. Must be positive for immediate index construction.
- metricstr or None, default=None
Distance metric. Supported values: * “angular”, “cosine” → cosine-like distance * “euclidean”, “l2”, “lstsq” → L2 distance * “manhattan”, “l1”, “cityblock”, “taxicab” → L1 distance * “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct” → negative dot product * “hamming” → bitwise Hamming distance If None and f > 0, defaults to “angular” with FutureWarning.
- n_treesint, default=-1
Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.
- n_neighborsint, default=5
Default number of neighbors for queries (estimator parameter).
- on_disk_pathstr or None, default=None
Path for on-disk building. If provided, enables memory-efficient building for large indices.
- prefaultbool, default=False
Whether to prefault pages when loading (may improve query latency).
- seedint or None, default=None
Random seed for tree construction. If None, uses Annoy’s default. Value 0 is treated as “use default” and emits a UserWarning.
- verboseint or None, default=None
Verbosity level (clamped to [-2, 2]). Level >= 1 enables logging.
- schema_versionint, default=0
Pickle schema version marker (does not affect on-disk format).
- dtypestr, default=”float32”
Data type for embeddings. Supported values: * “float16” / “half” / “fp16” → float16_t (16-bit half precision) * “float32” / “single” / “fp32” → float (32-bit single precision, default) * “float64” / “double” / “fp64” → double (64-bit double precision) * “float128” / “quad” / “fp128” → float128_t (128-bit or long double) All types are accessed via the double-precision widened bridge. float16 values are narrowed on add_item; float128 gains no input precision but benefits from higher-precision internal arithmetic on GCC/Clang.
- index_dtypestr, default=”int32”
Index identifier type. Supported values: * “int8” → int8_t (max 127 items) * “uint8” → uint8_t (max 255 items) * “int16” → int16_t (max 32,767 items) * “uint16” → uint16_t (max 65,535 items) * “int32” → int32_t (max 2,147,483,647 items, default) * “uint32” → uint32_t (max 4,294,967,295 items) * “int64” → int64_t (max 9,223,372,036,854,775,807 items) * “uint64” → uint64_t (max 18,446,744,073,709,551,615 items)
- wrapper_dtypestr, default=”uint64”
Internal wrapper type (e.g., for Hamming packing). Future: “bool”, “uint8”, “uint32” etc.
- random_dtypestr, default=”uint64”
Random seed type. Currently only “uint64” supported.
- n_jobsint or None, default=None
Number of threads. If -1, uses all available cores.
- **kwargs
Future extensibility
- Attributes:
fintIndex.f: int
metricstr or NoneIndex.metric: Optional[str]
- ptrAnnoyIndexInterface*
Pointer to C++ index (NULL if not constructed).
- # State Indicators (Internal)
- _f_validbool
True if f has been set (> 0)
- _metric_validbool
True if metric has been configured
- _index_constructedbool
True if C++ index exists (ptr != NULL)
Notes
32-bit integer (4 bytes) can store values from −2**31 to 2**31−1, roughly ±2 billion.
64-bit integer (8 bytes) can store values from −2**63 to 2**63−1, roughly ±9 quintillion.
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.add_item(1, [0.2] * 128) >>> index.build(n_trees=10) >>> neighbors, distances = index.get_nns_by_item(0, n=5, include_distances=True)
set dtype:
>>> # Standard usage (float32) >>> index = Index(f=128, metric='angular', dtype='float32') >>> >>> # High precision (float64) >>> index = Index(f=128, metric='euclidean', dtype='float64') >>> >>> # Half precision (float16) - future >>> # index = Index(f=128, metric='angular', dtype='float16')
- add_item(self, item, vector) None#
Add a vector to the index.
- Parameters:
- itemint
Non-negative item identifier. For
index_dtype='int32'the maximum valid ID is2**31 - 1 = 2_147_483_647. Forindex_dtype='int64'the maximum is2**63 - 1.- vectorsequence
Embedding vector of length f
- Raises:
- IndexError
If item is negative
- OverflowError
If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).
- RuntimeError
If index is not constructed or already built
- ValueError
If vector dimension doesn’t match f
- Return type:
None
Notes
Must be called before build()
Item IDs need not be contiguous
After build(), call unbuild() to add more items
- build(self, int n_trees=-1, n_jobs=None) None#
Build the search forest (thread-safe, releases GIL).
- Parameters:
- n_treesint, default=-1
Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.
- n_jobsint or None, default=None
Number of threads. If -1, uses all available cores.
- Raises:
- RuntimeError
If index is not constructed or no items added
- Return type:
None
Notes
Index becomes read-only after build()
Auto n_trees formula: max(10, 2*f)
Call unbuild() to add more items
Releases GIL during C++ build operation
Allows concurrent Python threads to run
The C++ build itself is multi-threaded (n_jobs)
Examples
>>> # Multiple threads can build independently: >>> from concurrent.futures import ThreadPoolExecutor >>> def worker(index, i): ... index.build(n_trees=10) >>> with ThreadPoolExecutor(max_workers=4) as executor: ... futures = [executor.submit(worker, index, i) for i in range(4)]
- clone(self, **override_params) Self#
Create a copy of the index with optional parameter overrides.
- Parameters:
- **override_paramsdict
Parameters to override in the clone
- Returns:
- indexIndex
New index with same parameters (but no data)
- Return type:
Examples
>>> index1 = Index(f=128, metric='angular', seed=42) >>> index2 = index1.clone(seed=123) # Same f and metric, different seed
- classmethod deserialize(cls, dict data: Dict[str, Any]) Self#
Deserialize from dictionary.
- Parameters:
- datadict
Serialized state from serialize()
- Returns:
- indexIndex
Restored index instance
- Raises:
- TypeError
If data is not a dict
- ValueError
If data format is invalid
- Parameters:
- Return type:
Examples
>>> import json >>> index = Index(f=128, metric='angular', seed=42) >>> json_str = json.dumps(index.serialize(), default=str) >>> data = json.loads(json_str) >>> restored = Index.deserialize(data)
- f#
int
Embedding dimension.
- Returns:
- fint
Number of dimensions (0 means “unset / lazy”).
Notes
Immutable after index construction
Setting to 0 after construction raises ValueError
- Type:
- classmethod from_dict(cls, dict data: Dict[str, Any]) Self#
Alias for deserialize().
- get_distance(self, i, j)#
Compute distance between two stored items.
- Parameters:
- i, jint
Item IDs (must be < n_items). For
index_dtype='int32'max is2**31-1.
- Returns:
- distancefloat
Distance according to index metric
- Raises:
- IndexError
If i or j is negative or >= n_items
- OverflowError
If i or j exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).
- RuntimeError
If index not constructed
Notes
Does not require built index
For Hamming metric, distance is clipped to [0, f]
- get_feature_names_out(self, input_features=None)#
Get output feature names for the transformer-style API.
Output feature names are independent of input feature names and follow a stable schema based on
n_neighbors:('neighbor_0', 'neighbor_1', ..., 'neighbor_{k-1}').- Parameters:
- input_featuressequence of str or None, default=None
If provided, validated deterministically against the fitted input feature names (if
feature_names_in_was set duringfit()) and against the expected input dimensionality.
- Returns:
- feature_namestuple of str
Output feature names:
('neighbor_0', ..., 'neighbor_{k-1}')wherek == n_neighbors.
- Raises:
- AttributeError
If called before
fit/build.- ValueError
If
input_featuresis provided but does not matchfeature_names_in_.- TypeError
If
input_featureselements are not strings.
Examples
>>> idx = Index(3, metric='angular').fit([[1,0,0],[0,1,0]]) >>> idx.get_feature_names_out() ('neighbor_0', 'neighbor_1', 'neighbor_2', 'neighbor_3', 'neighbor_4')
- get_item(self, item)#
Retrieve a stored embedding vector.
- Parameters:
- itemint
Item ID (must be < n_items). For
index_dtype='int32'max is2**31-1.
- Returns:
- vectorlist[float]
Embedding vector of length f
- Raises:
- IndexError
If item is negative or >= n_items
- OverflowError
If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).
- RuntimeError
If index not constructed
- get_n_items(self) int#
Return number of items in the index.
- Returns:
- n_itemsint
Number of items added (may be sparse)
- Return type:
- get_n_trees(self) int#
Return number of trees in the index.
- Returns:
- n_treesint
Number of trees (0 if not built)
- Return type:
- get_nns_by_item(self, item, int n, int search_k=-1, bool include_distances=False)#
Find nearest neighbors (thread-safe, releases GIL).
- Parameters:
- itemint
Query item ID. For
index_dtype='int32'max is2**31-1.- nint
Number of neighbors to return
- search_kint, default=-1
Search effort. If -1, uses n_trees * n. Higher values = better accuracy but slower.
- include_distancesbool, default=False
If True, return (neighbors, distances) tuple
- Returns:
- neighborslist[int]
Item IDs of nearest neighbors
- distanceslist[float], optional
Distances to neighbors (only if include_distances=True)
- Raises:
- IndexError
If item >= n_items
- OverflowError
If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).
- ValueError
If n <= 0
- RuntimeError
If index not built
Notes
Releases GIL during query (true parallelism)
Multiple threads can query simultaneously
Linear speedup with thread count
Examples
>>> # Parallel queries from multiple threads: >>> from concurrent.futures import ThreadPoolExecutor >>> def query_worker(index, item_id): ... return index.get_nns_by_item(item_id, n=10) >>> with ThreadPoolExecutor(max_workers=8) as executor: ... results = list(executor.map( ... lambda i: query_worker(index, i), ... range(1000) ... )) >>> # True parallelism - all 8 threads run concurrently!
- get_nns_by_vector(self, vector, int n, int search_k=-1, bool include_distances=False)#
Query by vector (thread-safe, releases GIL).
- Parameters:
- vectorsequence
Query vector of length f
- nint
Number of neighbors to return
- search_kint, default=-1
Search effort. If -1, uses n_trees * n.
- include_distancesbool, default=False
If True, return (neighbors, distances) tuple
- Returns:
- neighborslist[int]
Item IDs of nearest neighbors
- distanceslist[float], optional
Distances to neighbors
- Raises:
- RuntimeError
If index not built
- ValueError
If n <= 0, or if vector length does not match index dimension f
- get_params(self, bool deep: bool = True) Dict[str, Any]#
Get parameters (sklearn-style).
- Parameters:
- deepbool, default=True
If True, include nested parameters (reserved for future use)
- Returns:
- paramsdict
Parameter dictionary with all configuration
- Parameters:
deep (bool)
- Return type:
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> params = index.get_params() >>> print(params['f']) 128 >>> print(params['metric']) 'angular'
- get_state(self) Dict[str, Any]#
Get complete state dictionary.
- Returns:
- statedict
Complete index state including: * Parameters (f, metric, etc.) * Index data (if built) * Configuration
- Return type:
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.build() >>> state = index.get_state() >>> print('f' in state) True >>> print('metric' in state) True
- is_built(self) bool#
Check if index has been built.
- Returns:
- builtbool
True if build() has been called
- Return type:
- is_empty(self) bool#
Check if index has no items.
- Returns:
- emptybool
True if no items added
- Return type:
- load(self, filename, bool prefault=False) None#
Load index from disk file.
- Parameters:
- filenamestr
Input file path
- prefaultbool, default=False
Whether to prefault pages into memory
- Raises:
- RuntimeError
If dimensions don’t match
- IOError
If file cannot be read
- Return type:
None
Notes
Dimension f and metric must match the saved index
prefault=True may improve query latency at cost of load time
- metric#
Optional[str]
Distance metric name.
- Returns:
- metricstr or None
Canonical metric name, or None if not configured.
Notes
Immutable after index construction
Returns canonical name even if alias was used in constructor
- Type:
- n_neighbors#
int
Default number of neighbors for queries.
- Type:
- on_disk_build(self, fn) Self#
Configure the index to build using an on-disk backing file.
Calling this method explicitly supersedes any
on_disk_pathset in the constructor. It is safe to call before anyadd_item()calls; the method ensures the C++ index is constructed first.- Parameters:
- fnstr
Path to the backing file. The file is created or overwritten.
- Returns:
- selfIndex
This instance (for method chaining).
- Raises:
- TypeError
If
fnis not a string.- ValueError
If
fnis empty.- IOError
If the C++
on_disk_buildcall fails (bad path, permissions…).- RuntimeError
If the index cannot be constructed (
formetricnot set).
- Return type:
Notes
After calling
on_disk_build, the index is backed by the file duringadd_item/build. Data written to disk is not a finished index file untilsave()or afterbuild()completes the tree structure on disk.For very large datasets that do not fit comfortably in RAM during construction this is the recommended workflow.
Examples
>>> idx = Index(3, metric='angular').on_disk_build("test.annoy") >>> for i, v in enumerate([[1,2,3],[4,5,6],[7,8,9]]): ... idx.add_item(i, v) >>> idx.build(n_trees=10)
- repr_info(self, bool include_n_items=True, bool include_n_trees=True, include_memory=None) str#
Rich dictionary-like string representation.
- Parameters:
- include_n_itemsbool, default=True
Include item count
- include_n_treesbool, default=True
Include tree count
- include_memorybool or None, default=None
Include memory usage estimate If None, includes only if index is built
- Returns:
- repr_strstr
Dictionary-style representation
- Return type:
Examples
>>> print(index.repr_info()) Annoy(**{'f': 128, 'metric': 'angular', 'n_items': 1000, 'n_trees': 10})
- save(self, filename, bool prefault=False) None#
Save index to disk file.
- Parameters:
- filenamestr
Output file path
- prefaultbool, default=False
Whether to prefault pages during save
- Raises:
- RuntimeError
If index not built
- IOError
If file cannot be written
- Return type:
None
- serialize(self) Dict[str, Any]#
Serialize to JSON-compatible dictionary.
Examples
>>> import json >>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.build() >>> data = index.serialize() >>> json_str = json.dumps(data, default=str) # handle bytes
- set_params(self, **params) Self#
Set parameters (sklearn-style).
- Parameters:
- **paramsdict
Parameters to update
- Returns:
- selfIndex
Returns self for method chaining
- Raises:
- ValueError
If trying to set immutable parameters after construction
- Return type:
Notes
Cannot modify f or metric after index construction
Can always modify n_neighbors, seed, verbose
Examples
>>> index = Index(f=128, metric='angular') >>> index.set_params(n_neighbors=10, seed=42) >>> index.build()
- set_seed(self, seed) None#
Set random seed for index construction.
- Parameters:
- seedint
Non-negative integer in [0, 2**64 - 1]. 0 uses Annoy’s deterministic default seed and emits a UserWarning.
- Returns:
- selfIndex
- Raises:
- TypeError
If seed is not an integer.
- ValueError
If seed is negative or exceeds uint64_t range [0, 2**64 - 1].
- Return type:
None
Notes
Must be called before build() to take effect.
set_seed(R) is not on AnnoyIndexInterfaceBase; the widened set_seed_w(uint64_t) bridge is used for all concrete index types.
Seed 0 triggers Annoy’s deterministic default (Kiss64Random::default_seed).
- set_state(self, dict state: Dict[str, Any]) None#
Restore state from dictionary.
- Parameters:
- statedict
State dictionary from get_state()
- Parameters:
- Return type:
None
Examples
>>> index1 = Index(f=128, metric='angular', seed=42) >>> index1.add_item(0, [0.1] * 128) >>> index1.build() >>> state = index1.get_state() >>> >>> index2 = Index() >>> index2.set_state(state) >>> # index2 now has same data as index1
- set_verbose(self, bool v) None#
Enable/disable verbose logging.
- Parameters:
- vbool
True to enable verbose output
- Return type:
None
Gallery examples#
Approximate Nearest Neighbors with Annoy — A Hamlet Example