Index#
- class scikitplot.annoy._annoy.Index(f: int | None = None, metric: str | None = None, int n_neighbors: int = 5, *, on_disk_path: str | None = None, bool prefault: bool = False, seed: int | None = None, verbose: int | None = None, int schema_version: int = 0, str dtype: str = 'float32', str index_dtype: str = 'int32', str wrapper_dtype: str = 'uint64', str random_dtype: str = 'uint64', **kwargs)#
Annoy Approximate Nearest Neighbors Index.
This is a Cython-powered Python wrapper around the Annoy C++ library.
- Parameters:
- fint or None, default=None
Embedding dimension. If 0 or None, dimension is inferred from first vector added. Must be positive for immediate index construction.
- metricstr or None, default=None
Distance metric. Supported values: * “angular”, “cosine” → cosine-like distance * “euclidean”, “l2”, “lstsq” → L2 distance * “manhattan”, “l1”, “cityblock”, “taxicab” → L1 distance * “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct” → negative dot product * “hamming” → bitwise Hamming distance If None and f > 0, defaults to “angular” with FutureWarning.
- n_neighborsint, default=5
Default number of neighbors for queries (estimator parameter).
- on_disk_pathstr or None, default=None
Path for on-disk building. If provided, enables memory-efficient building for large indices.
- prefaultbool, default=False
Whether to prefault pages when loading (may improve query latency).
- seedint or None, default=None
Random seed for tree construction. If None, uses Annoy’s default. Value 0 is treated as “use default” and emits a UserWarning.
- verboseint or None, default=None
Verbosity level (clamped to [-2, 2]). Level >= 1 enables logging.
- schema_versionint, default=0
Pickle schema version marker (does not affect on-disk format).
- dtypestr, default=’float32’
Data type: float16, float32, float64, float80, float128
- index_dtypestr, default=’int32’
Index type: int32, int64
- wrapper_dtypestr, default=’uint64’
Wrapper type (for Hamming): uint32, uint64
- random_dtypestr, default=’uint64’
Random seed type
- **kwargs
Future extensibility
- Attributes:
fintIndex.f: int
metricstr or NoneIndex.metric: Optional[str]
- ptrAnnoyIndexInterface*
Pointer to C++ index (NULL if not constructed).
- # State Indicators (Internal)
- _f_validbool
True if f has been set (> 0)
- _metric_validbool
True if metric has been configured
- _index_constructedbool
True if C++ index exists (ptr != NULL)
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.add_item(1, [0.2] * 128) >>> index.build(n_trees=10) >>> neighbors, distances = index.get_nns_by_item(0, n=5, include_distances=True)
set dtype:
>>> # Standard usage (float32) >>> index = Index(f=128, metric='angular', dtype='float32') >>> >>> # High precision (float64) >>> index = Index(f=128, metric='euclidean', dtype='float64') >>> >>> # Half precision (float16) - future >>> # index = Index(f=128, metric='angular', dtype='float16')
- add_item(self, int item, vector) None#
Add a vector to the index.
- Parameters:
- itemint
Non-negative item identifier
- vectorsequence
Embedding vector of length f
- Raises:
- RuntimeError
If index is not constructed or already built
- ValueError
If vector dimension doesn’t match f
- IndexError
If item is negative
- Return type:
None
Notes
Must be called before build()
Item IDs need not be contiguous
After build(), call unbuild() to add more items
- build(self, int n_trees=-1, int n_jobs=-1) None#
Build the search forest (thread-safe, releases GIL).
- Parameters:
- n_treesint, default=-1
Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.
- n_jobsint, default=-1
Number of threads. If -1, uses all available cores.
- Raises:
- RuntimeError
If index is not constructed or no items added
- Return type:
None
Notes
Index becomes read-only after build()
Auto n_trees formula: max(10, 2*f)
Call unbuild() to add more items
Releases GIL during C++ build operation
Allows concurrent Python threads to run
The C++ build itself is multi-threaded (n_jobs)
Examples
>>> # Multiple threads can build independently: >>> from concurrent.futures import ThreadPoolExecutor >>> def worker(index, i): ... index.build(n_trees=10) >>> with ThreadPoolExecutor(max_workers=4) as executor: ... futures = [executor.submit(worker, index, i) for i in range(4)]
- clone(self, **override_params) Self#
Create a copy of the index with optional parameter overrides.
- Parameters:
- **override_paramsdict
Parameters to override in the clone
- Returns:
- indexIndex
New index with same parameters (but no data)
- Return type:
Examples
>>> index1 = Index(f=128, metric='angular', seed=42) >>> index2 = index1.clone(seed=123) # Same f and metric, different seed
- classmethod deserialize(cls, dict data: Dict[str, Any]) Self#
Deserialize from dictionary.
- Parameters:
- datadict
Serialized state from serialize()
- Returns:
- indexIndex
Restored index instance
- Raises:
- TypeError
If data is not a dict
- ValueError
If data format is invalid
- Parameters:
- Return type:
Examples
>>> import json >>> index = Index(f=128, metric='angular', seed=42) >>> json_str = json.dumps(index.serialize(), default=str) >>> data = json.loads(json_str) >>> restored = Index.deserialize(data)
- f#
int
Embedding dimension.
- Returns:
- fint
Number of dimensions (0 means “unset / lazy”).
Notes
Immutable after index construction
Setting to 0 after construction raises ValueError
- Type:
- classmethod from_dict(cls, dict data: Dict[str, Any]) Self#
Alias for deserialize().
- get_distance(self, int i, int j)#
Compute distance between two stored items.
- Parameters:
- i, jint
Item IDs (must be < n_items)
- Returns:
- distancefloat
Distance according to index metric
- Raises:
- RuntimeError
If index not constructed
- IndexError
If i or j is negative or >= n_items
Notes
Does not require built index
For Hamming metric, distance is clipped to [0, f]
- get_item(self, int item)#
Retrieve a stored embedding vector.
- Parameters:
- itemint
Item ID (must be < n_items)
- Returns:
- vectorlist[float]
Embedding vector of length f
- Raises:
- RuntimeError
If index not constructed
- IndexError
If item is negative or >= n_items
- get_n_items(self) int#
Return number of items in the index.
- Returns:
- n_itemsint
Number of items added (may be sparse)
- Return type:
- get_n_trees(self) int#
Return number of trees in the index.
- Returns:
- n_treesint
Number of trees (0 if not built)
- Return type:
- get_nns_by_item(self, int item, int n, int search_k=-1, bool include_distances=False)#
Find nearest neighbors (thread-safe, releases GIL).
- Parameters:
- itemint
Query item ID (must be < n_items)
- nint
Number of neighbors to return
- search_kint, default=-1
Search effort. If -1, uses n_trees * n. Higher values = better accuracy but slower.
- include_distancesbool, default=False
If True, return (neighbors, distances) tuple
- Returns:
- neighborslist[int]
Item IDs of nearest neighbors
- distanceslist[float], optional
Distances to neighbors (only if include_distances=True)
- Raises:
- RuntimeError
If index not built
- IndexError
If item >= n_items
Notes
Releases GIL during query (true parallelism)
Multiple threads can query simultaneously
Linear speedup with thread count
Examples
>>> # Parallel queries from multiple threads: >>> from concurrent.futures import ThreadPoolExecutor >>> def query_worker(index, item_id): ... return index.get_nns_by_item(item_id, n=10) >>> with ThreadPoolExecutor(max_workers=8) as executor: ... results = list(executor.map( ... lambda i: query_worker(index, i), ... range(1000) ... )) >>> # True parallelism - all 8 threads run concurrently!
- get_nns_by_vector(self, vector, int n, int search_k=-1, bool include_distances=False)#
Query by vector (thread-safe, releases GIL).
- Parameters:
- vectorsequence
Query vector of length f
- nint
Number of neighbors to return
- search_kint, default=-1
Search effort. If -1, uses n_trees * n.
- include_distancesbool, default=False
If True, return (neighbors, distances) tuple
- Returns:
- neighborslist[int]
Item IDs of nearest neighbors
- distanceslist[float], optional
Distances to neighbors
- Raises:
- RuntimeError
If index not built
- ValueError
If vector dimension doesn’t match f
- get_params(self, bool deep: bool = True) Dict[str, Any]#
Get parameters (sklearn-style).
- Parameters:
- deepbool, default=True
If True, include nested parameters (reserved for future use)
- Returns:
- paramsdict
Parameter dictionary with all configuration
- Parameters:
deep (bool)
- Return type:
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> params = index.get_params() >>> print(params['f']) 128 >>> print(params['metric']) 'angular'
- get_state(self) Dict[str, Any]#
Get complete state dictionary.
- Returns:
- statedict
Complete index state including: * Parameters (f, metric, etc.) * Index data (if built) * Configuration
- Return type:
Examples
>>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.build() >>> state = index.get_state() >>> print('f' in state) True >>> print('metric' in state) True
- is_built(self) bool#
Check if index has been built.
- Returns:
- builtbool
True if build() has been called
- Return type:
- is_empty(self) bool#
Check if index has no items.
- Returns:
- emptybool
True if no items added
- Return type:
- load(self, filename, bool prefault=False) None#
Load index from disk file.
- Parameters:
- filenamestr
Input file path
- prefaultbool, default=False
Whether to prefault pages into memory
- Raises:
- RuntimeError
If dimensions don’t match
- IOError
If file cannot be read
- Return type:
None
Notes
Dimension f and metric must match the saved index
prefault=True may improve query latency at cost of load time
- metric#
Optional[str]
Distance metric name.
- Returns:
- metricstr or None
Canonical metric name, or None if not configured.
Notes
Immutable after index construction
Returns canonical name even if alias was used in constructor
- Type:
- n_neighbors#
int
Default number of neighbors for queries.
- Type:
- repr_info(self, bool include_n_items=True, bool include_n_trees=True, include_memory=None) str#
Rich dictionary-like string representation.
- Parameters:
- include_n_itemsbool, default=True
Include item count
- include_n_treesbool, default=True
Include tree count
- include_memorybool or None, default=None
Include memory usage estimate If None, includes only if index is built
- Returns:
- repr_strstr
Dictionary-style representation
- Return type:
Examples
>>> print(index.repr_info()) Annoy(**{'f': 128, 'metric': 'angular', 'n_items': 1000, 'n_trees': 10})
- save(self, filename, bool prefault=False) None#
Save index to disk file.
- Parameters:
- filenamestr
Output file path
- prefaultbool, default=False
Whether to prefault pages during save
- Raises:
- RuntimeError
If index not built
- IOError
If file cannot be written
- Return type:
None
- serialize(self) Dict[str, Any]#
Serialize to JSON-compatible dictionary.
Examples
>>> import json >>> index = Index(f=128, metric='angular', seed=42) >>> index.add_item(0, [0.1] * 128) >>> index.build() >>> data = index.serialize() >>> json_str = json.dumps(data, default=str) # handle bytes
- set_params(self, **params) Self#
Set parameters (sklearn-style).
- Parameters:
- **paramsdict
Parameters to update
- Returns:
- selfIndex
Returns self for method chaining
- Raises:
- ValueError
If trying to set immutable parameters after construction
- Return type:
Notes
Cannot modify f or metric after index construction
Can always modify n_neighbors, seed, verbose
Examples
>>> index = Index(f=128, metric='angular') >>> index.set_params(n_neighbors=10, seed=42) >>> index.build()
- set_seed(self, int seed) None#
Set random seed for index construction.
- Parameters:
- seedint
Random seed (0 uses default_seed)
- Return type:
None
Notes
Must be called before build()
Seed is normalized: 0 -> default_seed
Affects tree construction randomness
- set_state(self, dict state: Dict[str, Any]) None#
Restore state from dictionary.
- Parameters:
- statedict
State dictionary from get_state()
- Parameters:
- Return type:
None
Examples
>>> index1 = Index(f=128, metric='angular', seed=42) >>> index1.add_item(0, [0.1] * 128) >>> index1.build() >>> state = index1.get_state() >>> >>> index2 = Index() >>> index2.set_state(state) >>> # index2 now has same data as index1
- set_verbose(self, bool v) None#
Enable/disable verbose logging.
- Parameters:
- vbool
True to enable verbose output
- Return type:
None