Index#

class scikitplot.annoy._annoy.Index(int f: int | None = None, str metric: str | None = None, int n_neighbors: int = 5, *, str on_disk_path: str | None = None, bool prefault: bool = False, int seed: int | None = None, int verbose: int | None = None, int schema_version: int = 0, str dtype: str = 'float32', str index_dtype: str = 'int32', str wrapper_dtype: str = 'uint64', str random_dtype: str = 'uint64', int n_jobs: int | None = None, **kwargs)[source]#

Annoy Approximate Nearest Neighbors Index.

This is a Cython-powered Python wrapper around the Annoy C++ library.

Parameters:
fint or None, default=None

Embedding dimension. If 0 or None, dimension is inferred from first vector added. Must be positive for immediate index construction.

metricstr or None, default=None

Distance metric. Supported values: * “angular”, “cosine” → cosine-like distance * “euclidean”, “l2”, “lstsq” → L2 distance * “manhattan”, “l1”, “cityblock”, “taxicab” → L1 distance * “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct” → negative dot product * “hamming” → bitwise Hamming distance If None and f > 0, defaults to “angular” with FutureWarning.

n_treesint, default=-1

Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.

n_neighborsint, default=5

Default number of neighbors for queries (estimator parameter).

on_disk_pathstr or None, default=None

Path for on-disk building. If provided, enables memory-efficient building for large indices.

prefaultbool, default=False

Whether to prefault pages when loading (may improve query latency).

seedint or None, default=None

Random seed for tree construction. If None, uses Annoy’s default. Value 0 is treated as “use default” and emits a UserWarning.

verboseint or None, default=None

Verbosity level (clamped to [-2, 2]). Level >= 1 enables logging.

schema_versionint, default=0

Pickle schema version marker (does not affect on-disk format).

dtypestr, default=”float32”

Data type for embeddings. Supported values: * “float16” / “half” / “fp16” → float16_t (16-bit half precision) * “float32” / “single” / “fp32” → float (32-bit single precision, default) * “float64” / “double” / “fp64” → double (64-bit double precision) * “float128” / “quad” / “fp128” → float128_t (128-bit or long double) All types are accessed via the double-precision widened bridge. float16 values are narrowed on add_item; float128 gains no input precision but benefits from higher-precision internal arithmetic on GCC/Clang.

index_dtypestr, default=”int32”

Index identifier type. Supported values: * “int8” → int8_t (max 127 items) * “uint8” → uint8_t (max 255 items) * “int16” → int16_t (max 32,767 items) * “uint16” → uint16_t (max 65,535 items) * “int32” → int32_t (max 2,147,483,647 items, default) * “uint32” → uint32_t (max 4,294,967,295 items) * “int64” → int64_t (max 9,223,372,036,854,775,807 items) * “uint64” → uint64_t (max 18,446,744,073,709,551,615 items)

wrapper_dtypestr, default=”uint64”

Internal wrapper type (e.g., for Hamming packing). Future: “bool”, “uint8”, “uint32” etc.

random_dtypestr, default=”uint64”

Random seed type. Currently only “uint64” supported.

n_jobsint or None, default=None

Number of threads. If -1, uses all available cores.

**kwargs

Future extensibility

Attributes:
fint

Index.f: int

metricstr or None

Index.metric: Optional[str]

ptrAnnoyIndexInterface*

Pointer to C++ index (NULL if not constructed).

# State Indicators (Internal)
_f_validbool

True if f has been set (> 0)

_metric_validbool

True if metric has been configured

_index_constructedbool

True if C++ index exists (ptr != NULL)

Notes

  • 32-bit integer (4 bytes) can store values from −2**31 to 2**31−1, roughly ±2 billion.

  • 64-bit integer (8 bytes) can store values from −2**63 to 2**63−1, roughly ±9 quintillion.

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.add_item(1, [0.2] * 128)
>>> index.build(n_trees=10)
>>> neighbors, distances = index.get_nns_by_item(0, n=5, include_distances=True)

set dtype:

>>> # Standard usage (float32)
>>> index = Index(f=128, metric='angular', dtype='float32')
>>>
>>> # High precision (float64)
>>> index = Index(f=128, metric='euclidean', dtype='float64')
>>>
>>> # Half precision (float16) - future
>>> # index = Index(f=128, metric='angular', dtype='float16')
add_item(self, item, vector) None#

Add a vector to the index.

Parameters:
itemint

Non-negative item identifier. For index_dtype='int32' the maximum valid ID is 2**31 - 1 = 2_147_483_647. For index_dtype='int64' the maximum is 2**63 - 1.

vectorsequence

Embedding vector of length f

Raises:
IndexError

If item is negative

OverflowError

If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).

RuntimeError

If index is not constructed or already built

ValueError

If vector dimension doesn’t match f

Return type:

None

Notes

  • Must be called before build()

  • Item IDs need not be contiguous

  • After build(), call unbuild() to add more items

build(self, int n_trees=-1, n_jobs=None) None#

Build the search forest (thread-safe, releases GIL).

Parameters:
n_treesint, default=-1

Number of trees to build. If -1, auto-selects based on dimension. More trees = better accuracy but slower queries and more memory.

n_jobsint or None, default=None

Number of threads. If -1, uses all available cores.

Raises:
RuntimeError

If index is not constructed or no items added

Return type:

None

Notes

  • Index becomes read-only after build()

  • Auto n_trees formula: max(10, 2*f)

  • Call unbuild() to add more items

  • Releases GIL during C++ build operation

  • Allows concurrent Python threads to run

  • The C++ build itself is multi-threaded (n_jobs)

Examples

>>> # Multiple threads can build independently:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def worker(index, i):
...     index.build(n_trees=10)
>>> with ThreadPoolExecutor(max_workers=4) as executor:
...     futures = [executor.submit(worker, index, i) for i in range(4)]
clone(self, **override_params) Self#

Create a copy of the index with optional parameter overrides.

Parameters:
**override_paramsdict

Parameters to override in the clone

Returns:
indexIndex

New index with same parameters (but no data)

Return type:

Self

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index2 = index1.clone(seed=123)  # Same f and metric, different seed
classmethod deserialize(cls, dict data: Dict[str, Any]) Self#

Deserialize from dictionary.

Parameters:
datadict

Serialized state from serialize()

Returns:
indexIndex

Restored index instance

Raises:
TypeError

If data is not a dict

ValueError

If data format is invalid

Parameters:

data (Dict[str, Any])

Return type:

Self

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> json_str = json.dumps(index.serialize(), default=str)
>>> data = json.loads(json_str)
>>> restored = Index.deserialize(data)
f#

int

Embedding dimension.

Returns:
fint

Number of dimensions (0 means “unset / lazy”).

Notes

  • Immutable after index construction

  • Setting to 0 after construction raises ValueError

Type:

Index.f

classmethod from_dict(cls, dict data: Dict[str, Any]) Self#

Alias for deserialize().

Parameters:
datadict

Serialized state

Returns:
Index

Restored instance

Parameters:

data (Dict[str, Any])

Return type:

Self

get_distance(self, i, j)#

Compute distance between two stored items.

Parameters:
i, jint

Item IDs (must be < n_items). For index_dtype='int32' max is 2**31-1.

Returns:
distancefloat

Distance according to index metric

Raises:
IndexError

If i or j is negative or >= n_items

OverflowError

If i or j exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).

RuntimeError

If index not constructed

Notes

  • Does not require built index

  • For Hamming metric, distance is clipped to [0, f]

get_feature_names_out(self, input_features=None)#

Get output feature names for the transformer-style API.

Output feature names are independent of input feature names and follow a stable schema based on n_neighbors: ('neighbor_0', 'neighbor_1', ..., 'neighbor_{k-1}').

Parameters:
input_featuressequence of str or None, default=None

If provided, validated deterministically against the fitted input feature names (if feature_names_in_ was set during fit()) and against the expected input dimensionality.

Returns:
feature_namestuple of str

Output feature names: ('neighbor_0', ..., 'neighbor_{k-1}') where k == n_neighbors.

Raises:
AttributeError

If called before fit / build.

ValueError

If input_features is provided but does not match feature_names_in_.

TypeError

If input_features elements are not strings.

Examples

>>> idx = Index(3, metric='angular').fit([[1,0,0],[0,1,0]])
>>> idx.get_feature_names_out()
('neighbor_0', 'neighbor_1', 'neighbor_2', 'neighbor_3', 'neighbor_4')
get_item(self, item)#

Retrieve a stored embedding vector.

Parameters:
itemint

Item ID (must be < n_items). For index_dtype='int32' max is 2**31-1.

Returns:
vectorlist[float]

Embedding vector of length f

Raises:
IndexError

If item is negative or >= n_items

OverflowError

If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).

RuntimeError

If index not constructed

get_n_items(self) int#

Return number of items in the index.

Returns:
n_itemsint

Number of items added (may be sparse)

Return type:

int

get_n_trees(self) int#

Return number of trees in the index.

Returns:
n_treesint

Number of trees (0 if not built)

Return type:

int

get_nns_by_item(self, item, int n, int search_k=-1, bool include_distances=False)#

Find nearest neighbors (thread-safe, releases GIL).

Parameters:
itemint

Query item ID. For index_dtype='int32' max is 2**31-1.

nint

Number of neighbors to return

search_kint, default=-1

Search effort. If -1, uses n_trees * n. Higher values = better accuracy but slower.

include_distancesbool, default=False

If True, return (neighbors, distances) tuple

Returns:
neighborslist[int]

Item IDs of nearest neighbors

distanceslist[float], optional

Distances to neighbors (only if include_distances=True)

Raises:
IndexError

If item >= n_items

OverflowError

If item exceeds the maximum for the configured index_dtype (e.g. 2**31-1 for int32, 2**63-1 for int64, 2**64-1 for uint64).

ValueError

If n <= 0

RuntimeError

If index not built

Notes

  • Releases GIL during query (true parallelism)

  • Multiple threads can query simultaneously

  • Linear speedup with thread count

Examples

>>> # Parallel queries from multiple threads:
>>> from concurrent.futures import ThreadPoolExecutor
>>> def query_worker(index, item_id):
...     return index.get_nns_by_item(item_id, n=10)
>>> with ThreadPoolExecutor(max_workers=8) as executor:
...     results = list(executor.map(
...         lambda i: query_worker(index, i),
...         range(1000)
...     ))
>>> # True parallelism - all 8 threads run concurrently!
get_nns_by_vector(self, vector, int n, int search_k=-1, bool include_distances=False)#

Query by vector (thread-safe, releases GIL).

Parameters:
vectorsequence

Query vector of length f

nint

Number of neighbors to return

search_kint, default=-1

Search effort. If -1, uses n_trees * n.

include_distancesbool, default=False

If True, return (neighbors, distances) tuple

Returns:
neighborslist[int]

Item IDs of nearest neighbors

distanceslist[float], optional

Distances to neighbors

Raises:
RuntimeError

If index not built

ValueError

If n <= 0, or if vector length does not match index dimension f

get_params(self, bool deep: bool = True) Dict[str, Any]#

Get parameters (sklearn-style).

Parameters:
deepbool, default=True

If True, include nested parameters (reserved for future use)

Returns:
paramsdict

Parameter dictionary with all configuration

Parameters:

deep (bool)

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> params = index.get_params()
>>> print(params['f'])
128
>>> print(params['metric'])
'angular'
get_state(self) Dict[str, Any]#

Get complete state dictionary.

Returns:
statedict

Complete index state including: * Parameters (f, metric, etc.) * Index data (if built) * Configuration

Return type:

Dict[str, Any]

Examples

>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> state = index.get_state()
>>> print('f' in state)
True
>>> print('metric' in state)
True
is_built(self) bool#

Check if index has been built.

Returns:
builtbool

True if build() has been called

Return type:

bool

is_empty(self) bool#

Check if index has no items.

Returns:
emptybool

True if no items added

Return type:

bool

load(self, filename, bool prefault=False) None#

Load index from disk file.

Parameters:
filenamestr

Input file path

prefaultbool, default=False

Whether to prefault pages into memory

Raises:
RuntimeError

If dimensions don’t match

IOError

If file cannot be read

Return type:

None

Notes

  • Dimension f and metric must match the saved index

  • prefault=True may improve query latency at cost of load time

metric#

Optional[str]

Distance metric name.

Returns:
metricstr or None

Canonical metric name, or None if not configured.

Notes

  • Immutable after index construction

  • Returns canonical name even if alias was used in constructor

Type:

Index.metric

n_neighbors#

int

Default number of neighbors for queries.

Type:

Index.n_neighbors

on_disk_build(self, fn) Self#

Configure the index to build using an on-disk backing file.

Calling this method explicitly supersedes any on_disk_path set in the constructor. It is safe to call before any add_item() calls; the method ensures the C++ index is constructed first.

Parameters:
fnstr

Path to the backing file. The file is created or overwritten.

Returns:
selfIndex

This instance (for method chaining).

Raises:
TypeError

If fn is not a string.

ValueError

If fn is empty.

IOError

If the C++ on_disk_build call fails (bad path, permissions…).

RuntimeError

If the index cannot be constructed (f or metric not set).

Return type:

Self

Notes

  • After calling on_disk_build, the index is backed by the file during add_item / build. Data written to disk is not a finished index file until save() or after build() completes the tree structure on disk.

  • For very large datasets that do not fit comfortably in RAM during construction this is the recommended workflow.

Examples

>>> idx = Index(3, metric='angular').on_disk_build("test.annoy")
>>> for i, v in enumerate([[1,2,3],[4,5,6],[7,8,9]]):
...     idx.add_item(i, v)
>>> idx.build(n_trees=10)
repr_info(self, bool include_n_items=True, bool include_n_trees=True, include_memory=None) str#

Rich dictionary-like string representation.

Parameters:
include_n_itemsbool, default=True

Include item count

include_n_treesbool, default=True

Include tree count

include_memorybool or None, default=None

Include memory usage estimate If None, includes only if index is built

Returns:
repr_strstr

Dictionary-style representation

Return type:

str

Examples

>>> print(index.repr_info())
Annoy(**{'f': 128, 'metric': 'angular', 'n_items': 1000, 'n_trees': 10})
save(self, filename, bool prefault=False) None#

Save index to disk file.

Parameters:
filenamestr

Output file path

prefaultbool, default=False

Whether to prefault pages during save

Raises:
RuntimeError

If index not built

IOError

If file cannot be written

Return type:

None

serialize(self) Dict[str, Any]#

Serialize to JSON-compatible dictionary.

Returns:
datadict

JSON-serializable state

Return type:

Dict[str, Any]

Examples

>>> import json
>>> index = Index(f=128, metric='angular', seed=42)
>>> index.add_item(0, [0.1] * 128)
>>> index.build()
>>> data = index.serialize()
>>> json_str = json.dumps(data, default=str)  # handle bytes
set_params(self, **params) Self#

Set parameters (sklearn-style).

Parameters:
**paramsdict

Parameters to update

Returns:
selfIndex

Returns self for method chaining

Raises:
ValueError

If trying to set immutable parameters after construction

Return type:

Self

Notes

  • Cannot modify f or metric after index construction

  • Can always modify n_neighbors, seed, verbose

Examples

>>> index = Index(f=128, metric='angular')
>>> index.set_params(n_neighbors=10, seed=42)
>>> index.build()
set_seed(self, seed) None#

Set random seed for index construction.

Parameters:
seedint

Non-negative integer in [0, 2**64 - 1]. 0 uses Annoy’s deterministic default seed and emits a UserWarning.

Returns:
selfIndex
Raises:
TypeError

If seed is not an integer.

ValueError

If seed is negative or exceeds uint64_t range [0, 2**64 - 1].

Return type:

None

Notes

  • Must be called before build() to take effect.

  • set_seed(R) is not on AnnoyIndexInterfaceBase; the widened set_seed_w(uint64_t) bridge is used for all concrete index types.

  • Seed 0 triggers Annoy’s deterministic default (Kiss64Random::default_seed).

set_state(self, dict state: Dict[str, Any]) None#

Restore state from dictionary.

Parameters:
statedict

State dictionary from get_state()

Parameters:

state (Dict[str, Any])

Return type:

None

Examples

>>> index1 = Index(f=128, metric='angular', seed=42)
>>> index1.add_item(0, [0.1] * 128)
>>> index1.build()
>>> state = index1.get_state()
>>>
>>> index2 = Index()
>>> index2.set_state(state)
>>> # index2 now has same data as index1
set_verbose(self, bool v) None#

Enable/disable verbose logging.

Parameters:
vbool

True to enable verbose output

Return type:

None

to_dict(self) Dict[str, Any]#

Alias for serialize().

Returns:
dict

Serialized state

Return type:

Dict[str, Any]

unbuild(self) None#

Remove all trees to allow adding more items.

Transitions index back to BUILDING state.

Raises:
RuntimeError

If index is not built

Return type:

None

unload(self) None#

Unmap memory-mapped files and free memory.

Transitions index to EMPTY state. Safe to call multiple times.

Return type:

None