Index#

class scikitplot.annoy._annoy.Index(f: int | None = None, metric: str | None = None, int n_neighbors: int = 5, *, on_disk_path: str | None = None, bool prefault: bool = False, seed: int | None = None, verbose: int | None = None, int schema_version: int = 0, str dtype: str = 'float32', str index_dtype: str = 'int32', str wrapper_dtype: str = 'uint64', str random_dtype: str = 'uint64', **kwargs)#

Annoy Approximate Nearest Neighbors Index.

This is a Cython-powered Python wrapper around the Annoy C++ library.

Parameters:
fint or None, default=None

Embedding dimension. If 0 or None, dimension is inferred from first vector added. Must be positive for immediate index construction.

metricstr or None, default=None

Distance metric. Supported values: * “angular”, “cosine” → cosine-like distance * “euclidean”, “l2”, “lstsq” → L2 distance * “manhattan”, “l1”, “cityblock”, “taxicab” → L1 distance * “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct” → negative dot product * “hamming” → bitwise Hamming distance If None and f > 0, defaults to “angular” with FutureWarning.

n_neighborsint, default=5

Default number of neighbors for queries (estimator parameter).

on_disk_pathstr or None, default=None

Path for on-disk building. If provided, enables memory-efficient building for large indices.

prefaultbool, default=False

Whether to prefault pages when loading (may improve query latency).

seedint or None, default=None

Random seed for tree construction. If None, uses Annoy’s default. Value 0 is treated as “use default” and emits a UserWarning.

verboseint or None, default=None

Verbosity level (clamped to [-2, 2]). Level >= 1 enables logging.

schema_versionint, default=0

Pickle schema version marker (does not affect on-disk format).

dtypestr, default=’float32’

Data type: float16, float32, float64, float80, float128

index_dtypestr, default=’int32’

Index type: int32, int64

wrapper_dtypestr, default=’uint64’

Wrapper type (for Hamming): uint32, uint64

random_dtypestr, default=’uint64’

Random seed type

**kwargs

Future extensibility

Attributes:
fint

Index.f: int

metricstr or None

Index.metric: Optional[str]

ptrAnnoyIndexInterface*

Pointer to C++ index (NULL if not constructed).

# State Indicators (Internal)
_f_validbool

True if f has been set (> 0)

_metric_validbool

True if metric has been configured

_index_constructedbool

True if C++ index exists (ptr != NULL)

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.add_item(1, [0.2] * 128)
>>> index.build(n_trees=10)
>>> neighbors, distances = index.get_nns_by_item(0, n=5, include_distances=True)

set dtype:

>>> # Standard usage (float32)
>>> index = Index(f=128, metric='angular', dtype='float32')
>>>
>>> # High precision (float64)
>>> index = Index(f=128, metric='euclidean', dtype='float64')
>>>
>>> # Half precision (float16) - future
>>> # index = Index(f=128, metric='angular', dtype='float16')
add_item(self, int item, vector) None#

Add a vector to the index.

Parameters:
itemint

Non-negative item identifier

vectorsequence

Embedding vector of length f

Raises:
RuntimeError

If index is not constructed or already built

ValueError

If vector dimension doesn’t match f

IndexError

If item is negative

Return type:

None

Notes

  • Must be called before build()

  • Item IDs need not be contiguous

  • After build(), call unbuild() to add more items

build(self, int n_trees=-1, int n_jobs=-1) None#

Build the search forest (thread-safe, releases GIL).

Parameters:
n_treesint, default=-1

Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.

n_jobsint, default=-1

Number of threads. If -1, uses all available cores.

Raises:
RuntimeError

If index is not constructed or no items added

Return type:

None

Notes

  • Index becomes read-only after build()

  • Auto n_trees formula: max(10, 2*f)

  • Call unbuild() to add more items

  • Releases GIL during C++ build operation

  • Allows concurrent Python threads to run

  • The C++ build itself is multi-threaded (n_jobs)

Examples

>>> # Multiple threads can build independently:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def worker(index, i):
...     index.build(n_trees=10)
>>> with ThreadPoolExecutor(max_workers=4) as executor:
...     futures = [executor.submit(worker, index, i) for i in range(4)]
clone(self, **override_params) Self#

Create a copy of the index with optional parameter overrides.

Parameters:
**override_paramsdict

Parameters to override in the clone

Returns:
indexIndex

New index with same parameters (but no data)

Return type:

Self

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index2 = index1.clone(seed=123)  # Same f and metric, different seed
classmethod deserialize(cls, dict data: Dict[str, Any]) Self#

Deserialize from dictionary.

Parameters:
datadict

Serialized state from serialize()

Returns:
indexIndex

Restored index instance

Raises:
TypeError

If data is not a dict

ValueError

If data format is invalid

Parameters:

data (Dict[str, Any])

Return type:

Self

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> json_str = json.dumps(index.serialize(), default=str)
>>> data = json.loads(json_str)
>>> restored = Index.deserialize(data)
f#

int

Embedding dimension.

Returns:
fint

Number of dimensions (0 means “unset / lazy”).

Notes

  • Immutable after index construction

  • Setting to 0 after construction raises ValueError

Type:

Index.f

classmethod from_dict(cls, dict data: Dict[str, Any]) Self#

Alias for deserialize().

Parameters:
datadict

Serialized state

Returns:
Index

Restored instance

Parameters:

data (Dict[str, Any])

Return type:

Self

get_distance(self, int i, int j)#

Compute distance between two stored items.

Parameters:
i, jint

Item IDs (must be < n_items)

Returns:
distancefloat

Distance according to index metric

Raises:
RuntimeError

If index not constructed

IndexError

If i or j is negative or >= n_items

Notes

  • Does not require built index

  • For Hamming metric, distance is clipped to [0, f]

get_item(self, int item)#

Retrieve a stored embedding vector.

Parameters:
itemint

Item ID (must be < n_items)

Returns:
vectorlist[float]

Embedding vector of length f

Raises:
RuntimeError

If index not constructed

IndexError

If item is negative or >= n_items

get_n_items(self) int#

Return number of items in the index.

Returns:
n_itemsint

Number of items added (may be sparse)

Return type:

int

get_n_trees(self) int#

Return number of trees in the index.

Returns:
n_treesint

Number of trees (0 if not built)

Return type:

int

get_nns_by_item(self, int item, int n, int search_k=-1, bool include_distances=False)#

Find nearest neighbors (thread-safe, releases GIL).

Parameters:
itemint

Query item ID (must be < n_items)

nint

Number of neighbors to return

search_kint, default=-1

Search effort. If -1, uses n_trees * n. Higher values = better accuracy but slower.

include_distancesbool, default=False

If True, return (neighbors, distances) tuple

Returns:
neighborslist[int]

Item IDs of nearest neighbors

distanceslist[float], optional

Distances to neighbors (only if include_distances=True)

Raises:
RuntimeError

If index not built

IndexError

If item >= n_items

Notes

  • Releases GIL during query (true parallelism)

  • Multiple threads can query simultaneously

  • Linear speedup with thread count

Examples

>>> # Parallel queries from multiple threads:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def query_worker(index, item_id):
...     return index.get_nns_by_item(item_id, n=10)
>>> with ThreadPoolExecutor(max_workers=8) as executor:
...     results = list(executor.map(
...         lambda i: query_worker(index, i),
...         range(1000)
...     ))
>>> # True parallelism - all 8 threads run concurrently!
get_nns_by_vector(self, vector, int n, int search_k=-1, bool include_distances=False)#

Query by vector (thread-safe, releases GIL).

Parameters:
vectorsequence

Query vector of length f

nint

Number of neighbors to return

search_kint, default=-1

Search effort. If -1, uses n_trees * n.

include_distancesbool, default=False

If True, return (neighbors, distances) tuple

Returns:
neighborslist[int]

Item IDs of nearest neighbors

distanceslist[float], optional

Distances to neighbors

Raises:
RuntimeError

If index not built

ValueError

If vector dimension doesn’t match f

get_params(self, bool deep: bool = True) Dict[str, Any]#

Get parameters (sklearn-style).

Parameters:
deepbool, default=True

If True, include nested parameters (reserved for future use)

Returns:
paramsdict

Parameter dictionary with all configuration

Parameters:

deep (bool)

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> params = index.get_params()
>>> print(params['f'])
128
>>> print(params['metric'])
'angular'
get_state(self) Dict[str, Any]#

Get complete state dictionary.

Returns:
statedict

Complete index state including: * Parameters (f, metric, etc.) * Index data (if built) * Configuration

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> state = index.get_state()
>>> print('f' in state)
True
>>> print('metric' in state)
True
is_built(self) bool#

Check if index has been built.

Returns:
builtbool

True if build() has been called

Return type:

bool

is_empty(self) bool#

Check if index has no items.

Returns:
emptybool

True if no items added

Return type:

bool

load(self, filename, bool prefault=False) None#

Load index from disk file.

Parameters:
filenamestr

Input file path

prefaultbool, default=False

Whether to prefault pages into memory

Raises:
RuntimeError

If dimensions don’t match

IOError

If file cannot be read

Return type:

None

Notes

  • Dimension f and metric must match the saved index

  • prefault=True may improve query latency at cost of load time

metric#

Optional[str]

Distance metric name.

Returns:
metricstr or None

Canonical metric name, or None if not configured.

Notes

  • Immutable after index construction

  • Returns canonical name even if alias was used in constructor

Type:

Index.metric

n_neighbors#

int

Default number of neighbors for queries.

Type:

Index.n_neighbors

repr_info(self, bool include_n_items=True, bool include_n_trees=True, include_memory=None) str#

Rich dictionary-like string representation.

Parameters:
include_n_itemsbool, default=True

Include item count

include_n_treesbool, default=True

Include tree count

include_memorybool or None, default=None

Include memory usage estimate If None, includes only if index is built

Returns:
repr_strstr

Dictionary-style representation

Return type:

str

Examples

>>> print(index.repr_info())
Annoy(**{'f': 128, 'metric': 'angular', 'n_items': 1000, 'n_trees': 10})
save(self, filename, bool prefault=False) None#

Save index to disk file.

Parameters:
filenamestr

Output file path

prefaultbool, default=False

Whether to prefault pages during save

Raises:
RuntimeError

If index not built

IOError

If file cannot be written

Return type:

None

serialize(self) Dict[str, Any]#

Serialize to JSON-compatible dictionary.

Returns:
datadict

JSON-serializable state

Return type:

Dict[str, Any]

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> data = index.serialize()
>>> json_str = json.dumps(data, default=str)  # handle bytes
set_params(self, **params) Self#

Set parameters (sklearn-style).

Parameters:
**paramsdict

Parameters to update

Returns:
selfIndex

Returns self for method chaining

Raises:
ValueError

If trying to set immutable parameters after construction

Return type:

Self

Notes

  • Cannot modify f or metric after index construction

  • Can always modify n_neighbors, seed, verbose

Examples

>>> index = Index(f=128, metric='angular')
>>> index.set_params(n_neighbors=10, seed=42)
>>> index.build()
set_seed(self, int seed) None#

Set random seed for index construction.

Parameters:
seedint

Random seed (0 uses default_seed)

Return type:

None

Notes

  • Must be called before build()

  • Seed is normalized: 0 -> default_seed

  • Affects tree construction randomness

set_state(self, dict state: Dict[str, Any]) None#

Restore state from dictionary.

Parameters:
statedict

State dictionary from get_state()

Parameters:

state (Dict[str, Any])

Return type:

None

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index1.add_item(0, [0.1] * 128)
>>> index1.build()
>>> state = index1.get_state()
>>>
>>> index2 = Index()
>>> index2.set_state(state)
>>> # index2 now has same data as index1
set_verbose(self, bool v) None#

Enable/disable verbose logging.

Parameters:
vbool

True to enable verbose output

Return type:

None

to_dict(self) Dict[str, Any]#

Alias for serialize().

Returns:
dict

Serialized state

Return type:

Dict[str, Any]

unbuild(self) None#

Remove all trees to allow adding more items.

Transitions index back to BUILDING state.

Raises:
RuntimeError

If index is not built

Return type:

None

unload(self) None#

Unmap memory-mapped files and free memory.

Transitions index to EMPTY state. Safe to call multiple times.

Return type:

None