Index#

class scikitplot.annoy._annoy.Index(f: int | None = None, metric: str | None = None, int n_neighbors: int = 5, *, on_disk_path: str | None = None, bool prefault: bool = False, seed: int | None = None, verbose: int | None = None, int schema_version: int = 0, str dtype: str = 'float32', str index_dtype: str = 'int32', str wrapper_dtype: str = 'uint64', str random_dtype: str = 'uint64', **kwargs)#

Annoy Approximate Nearest Neighbors Index.

This is a Cython-powered Python wrapper around the Annoy C++ library.

Parameters:

fint or None, default=None: Embedding dimension. If 0 or None, dimension is inferred from first vector added. Must be positive for immediate index construction.
metricstr or None, default=None: Distance metric. Supported values: * “angular”, “cosine” → cosine-like distance * “euclidean”, “l2”, “lstsq” → L2 distance * “manhattan”, “l1”, “cityblock”, “taxicab” → L1 distance * “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct” → negative dot product * “hamming” → bitwise Hamming distance If None and f > 0, defaults to “angular” with FutureWarning.
n_neighborsint, default=5: Default number of neighbors for queries (estimator parameter).
on_disk_pathstr or None, default=None: Path for on-disk building. If provided, enables memory-efficient building for large indices.
prefaultbool, default=False: Whether to prefault pages when loading (may improve query latency).
seedint or None, default=None: Random seed for tree construction. If None, uses Annoy’s default. Value 0 is treated as “use default” and emits a UserWarning.
verboseint or None, default=None: Verbosity level (clamped to [-2, 2]). Level >= 1 enables logging.
schema_versionint, default=0: Pickle schema version marker (does not affect on-disk format).
dtypestr, default=’float32’: Data type: float16, float32, float64, float80, float128
index_dtypestr, default=’int32’: Index type: int32, int64
wrapper_dtypestr, default=’uint64’: Wrapper type (for Hamming): uint32, uint64
random_dtypestr, default=’uint64’: Random seed type
**kwargs: Future extensibility

Attributes:

fint: Index.f: int
metricstr or None: Index.metric: Optional[str]
ptrAnnoyIndexInterface*: Pointer to C++ index (NULL if not constructed).
# State Indicators (Internal)
_f_validbool: True if f has been set (> 0)
_metric_validbool: True if metric has been configured
_index_constructedbool: True if C++ index exists (ptr != NULL)

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.add_item(1, [0.2] * 128)
>>> index.build(n_trees=10)
>>> neighbors, distances = index.get_nns_by_item(0, n=5, include_distances=True)

set dtype:

>>> # Standard usage (float32)
>>> index = Index(f=128, metric='angular', dtype='float32')
>>>
>>> # High precision (float64)
>>> index = Index(f=128, metric='euclidean', dtype='float64')
>>>
>>> # Half precision (float16) - future
>>> # index = Index(f=128, metric='angular', dtype='float16')

add_item(self, int item, vector) → None#

Add a vector to the index.

Parameters:

itemint: Non-negative item identifier
vectorsequence: Embedding vector of length f

Raises:

RuntimeError: If index is not constructed or already built
ValueError: If vector dimension doesn’t match f
IndexError: If item is negative

Return type:

None

Notes

Must be called before build()
Item IDs need not be contiguous
After build(), call unbuild() to add more items

build(self, int n_trees=-1, int n_jobs=-1) → None#

Build the search forest (thread-safe, releases GIL).

Parameters:

n_treesint, default=-1: Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.
n_jobsint, default=-1: Number of threads. If -1, uses all available cores.

Raises:

RuntimeError: If index is not constructed or no items added

Return type:

None

Notes

Index becomes read-only after build()
Auto n_trees formula: max(10, 2*f)
Call unbuild() to add more items
Releases GIL during C++ build operation
Allows concurrent Python threads to run
The C++ build itself is multi-threaded (n_jobs)

Examples

>>> # Multiple threads can build independently:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def worker(index, i):
...     index.build(n_trees=10)
>>> with ThreadPoolExecutor(max_workers=4) as executor:
...     futures = [executor.submit(worker, index, i) for i in range(4)]

clone(self, **override_params) → Self#

Create a copy of the index with optional parameter overrides.

Parameters:

**override_paramsdict: Parameters to override in the clone

Returns:

indexIndex: New index with same parameters (but no data)

Return type:

Self

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index2 = index1.clone(seed=123)  # Same f and metric, different seed

classmethod deserialize(cls, dict data: Dict[str, Any]) → Self#

Deserialize from dictionary.

Parameters:

datadict: Serialized state from serialize()

Returns:

indexIndex: Restored index instance

Raises:

TypeError: If data is not a dict
ValueError: If data format is invalid

Parameters:

data (Dict[str, Any])

Return type:

Self

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> json_str = json.dumps(index.serialize(), default=str)
>>> data = json.loads(json_str)
>>> restored = Index.deserialize(data)

f#

int

Embedding dimension.

Returns:

fint: Number of dimensions (0 means “unset / lazy”).

Notes

Immutable after index construction
Setting to 0 after construction raises ValueError

Type:: Index.f

classmethod from_dict(cls, dict data: Dict[str, Any]) → Self#

Alias for deserialize().

Parameters:

datadict: Serialized state

Returns:

Index: Restored instance

Parameters:

data (Dict[str, Any])

Return type:

Self

get_distance(self, int i, int j)#

Compute distance between two stored items.

Parameters:

i, jint: Item IDs (must be < n_items)

Returns:

distancefloat: Distance according to index metric

Raises:

RuntimeError: If index not constructed
IndexError: If i or j is negative or >= n_items

Notes

Does not require built index
For Hamming metric, distance is clipped to [0, f]

get_item(self, int item)#

Retrieve a stored embedding vector.

Parameters:

itemint: Item ID (must be < n_items)

Returns:

vectorlist[float]: Embedding vector of length f

Raises:

RuntimeError: If index not constructed
IndexError: If item is negative or >= n_items

get_n_items(self) → int#

Return number of items in the index.

Returns:

n_itemsint: Number of items added (may be sparse)

Return type:

int

get_n_trees(self) → int#

Return number of trees in the index.

Returns:

n_treesint: Number of trees (0 if not built)

Return type:

int

get_nns_by_item(self, int item, int n, int search_k=-1, bool include_distances=False)#

Find nearest neighbors (thread-safe, releases GIL).

Parameters:

itemint: Query item ID (must be < n_items)
nint: Number of neighbors to return
search_kint, default=-1: Search effort. If -1, uses n_trees * n. Higher values = better accuracy but slower.
include_distancesbool, default=False: If True, return (neighbors, distances) tuple

Returns:

neighborslist[int]: Item IDs of nearest neighbors
distanceslist[float], optional: Distances to neighbors (only if include_distances=True)

Raises:

RuntimeError: If index not built
IndexError: If item >= n_items

Notes

Releases GIL during query (true parallelism)
Multiple threads can query simultaneously
Linear speedup with thread count

Examples

>>> # Parallel queries from multiple threads:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def query_worker(index, item_id):
...     return index.get_nns_by_item(item_id, n=10)
>>> with ThreadPoolExecutor(max_workers=8) as executor:
...     results = list(executor.map(
...         lambda i: query_worker(index, i),
...         range(1000)
...     ))
>>> # True parallelism - all 8 threads run concurrently!

get_nns_by_vector(self, vector, int n, int search_k=-1, bool include_distances=False)#

Query by vector (thread-safe, releases GIL).

Parameters:

vectorsequence: Query vector of length f
nint: Number of neighbors to return
search_kint, default=-1: Search effort. If -1, uses n_trees * n.
include_distancesbool, default=False: If True, return (neighbors, distances) tuple

Returns:

neighborslist[int]: Item IDs of nearest neighbors
distanceslist[float], optional: Distances to neighbors

Raises:

RuntimeError: If index not built
ValueError: If vector dimension doesn’t match f

get_params(self, bool deep: bool = True) → Dict[str, Any]#

Get parameters (sklearn-style).

Parameters:

deepbool, default=True: If True, include nested parameters (reserved for future use)

Returns:

paramsdict: Parameter dictionary with all configuration

Parameters:

deep (bool)

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> params = index.get_params()
>>> print(params['f'])
128
>>> print(params['metric'])
'angular'

get_state(self) → Dict[str, Any]#

Get complete state dictionary.

Returns:

statedict: Complete index state including: * Parameters (f, metric, etc.) * Index data (if built) * Configuration

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> state = index.get_state()
>>> print('f' in state)
True
>>> print('metric' in state)
True

is_built(self) → bool#

Check if index has been built.

Returns:

builtbool: True if build() has been called

Return type:

bool

is_empty(self) → bool#

Check if index has no items.

Returns:

emptybool: True if no items added

Return type:

bool

load(self, filename, bool prefault=False) → None#

Load index from disk file.

Parameters:

filenamestr: Input file path
prefaultbool, default=False: Whether to prefault pages into memory

Raises:

RuntimeError: If dimensions don’t match
IOError: If file cannot be read

Return type:

None

Notes

Dimension f and metric must match the saved index
prefault=True may improve query latency at cost of load time

metric#

Optional[str]

Distance metric name.

Returns:

metricstr or None: Canonical metric name, or None if not configured.

Notes

Immutable after index construction
Returns canonical name even if alias was used in constructor

Type:: Index.metric

n_neighbors#

int

Default number of neighbors for queries.

Type:: Index.n_neighbors

repr_info(self, bool include_n_items=True, bool include_n_trees=True, include_memory=None) → str#

Rich dictionary-like string representation.

Parameters:

include_n_itemsbool, default=True: Include item count
include_n_treesbool, default=True: Include tree count
include_memorybool or None, default=None: Include memory usage estimate If None, includes only if index is built

Returns:

repr_strstr: Dictionary-style representation

Return type:

str

Examples

>>> print(index.repr_info())
Annoy(**{'f': 128, 'metric': 'angular', 'n_items': 1000, 'n_trees': 10})

save(self, filename, bool prefault=False) → None#

Save index to disk file.

Parameters:

filenamestr: Output file path
prefaultbool, default=False: Whether to prefault pages during save

Raises:

RuntimeError: If index not built
IOError: If file cannot be written

Return type:

None

serialize(self) → Dict[str, Any]#

Serialize to JSON-compatible dictionary.

Returns:

datadict: JSON-serializable state

Return type:

Dict[str, Any]

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> data = index.serialize()
>>> json_str = json.dumps(data, default=str)  # handle bytes

set_params(self, **params) → Self#

Set parameters (sklearn-style).

Parameters:

**paramsdict: Parameters to update

Returns:

selfIndex: Returns self for method chaining

Raises:

ValueError: If trying to set immutable parameters after construction

Return type:

Self

Notes

Cannot modify f or metric after index construction
Can always modify n_neighbors, seed, verbose

Examples

>>> index = Index(f=128, metric='angular')
>>> index.set_params(n_neighbors=10, seed=42)
>>> index.build()

set_seed(self, int seed) → None#

Set random seed for index construction.

Parameters:

seedint: Random seed (0 uses default_seed)

Return type:

None

Notes

Must be called before build()
Seed is normalized: 0 -> default_seed
Affects tree construction randomness

set_state(self, dict state: Dict[str, Any]) → None#

Restore state from dictionary.

Parameters:

statedict: State dictionary from get_state()

Parameters:

state (Dict[str, Any])

Return type:

None

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index1.add_item(0, [0.1] * 128)
>>> index1.build()
>>> state = index1.get_state()
>>>
>>> index2 = Index()
>>> index2.set_state(state)
>>> # index2 now has same data as index1

set_verbose(self, bool v) → None#

Enable/disable verbose logging.

Parameters:

vbool: True to enable verbose output

Return type:

None

to_dict(self) → Dict[str, Any]#

Alias for serialize().

Returns:

dict: Serialized state

Return type:

Dict[str, Any]

unbuild(self) → None#

Remove all trees to allow adding more items.

Transitions index back to BUILDING state.

Raises:

RuntimeError: If index is not built

Return type:

None

unload(self) → None#

Unmap memory-mapped files and free memory.

Transitions index to EMPTY state. Safe to call multiple times.

Return type:: None

Gallery examples#

Index (cython) python-api with examples

Index#

Gallery examples#

This Page