ANNImputer#

class scikitplot.impute._ann.ANNImputer(*, missing_values=nan, backend='annoy', index_access='external', index_store_path=None, index_base_dir=None, on_disk_build=False, n_trees=-1, search_k=-1, n_neighbors=5, weights='uniform', metric='angular', initial_strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False, n_jobs=-1, random_state=None)[source]#

Approximate K-nearest-neighbours (KNN) imputer with pluggable ANN backends.

ANNImputer performs vector-based imputation by querying an approximate nearest-neighbours (ANN) index instead of using exact brute-force distances as in KNNImputer.

Two backends are currently supported:

backend='annoy' (default): uses the Spotify Annoy library and the in-tree Index wrapper.
backend='voyager': uses the optional voyager package (HNSW-based index).

All high-level imputation parameters (n_neighbors, weights, metric, index_access, index_store_path, index_base_dir) are shared across backends. Backend-specific details (such as the Annoy forest size) are handled internally.

This imputer identifies approximate nearest neighbors for samples containing missing values and imputes those values using statistics computed from the retrieved neighbor vectors.

Parameters:

missing_valuesint, float, str, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

backend{‘annoy’, ‘voyager’}, default=’annoy’

Name of the approximate nearest-neighbour backend to use.

'annoy': use the modified Spotify Annoy library (in-tree wrapper).
'voyager': use the optional voyager package.

When backend='voyager' the voyager package must be installed. An ImportError is raised otherwise.

Parameters index_access, index_store_path and index_base_dir behave identically for both backends.

index_access{‘public’, ‘private’, ‘external’}, default=’external’

Controls whether and how the fitted ANN index is exposed or stored.

'public': train_index_ returns the underlying ANN index instance (backwards compatible behaviour).
'private': Any attempt to access train_index_ raises AttributeError. The index is still used internally during transform, but is not directly exposed to user code.
'external': The fitted ANN index is persisted to disk using the backend index’s save method (e.g. AnnoyIndex.save or voyager.Index.save) and only the file name (index_path_) and metadata (index_created_at_) are stored on the estimator. At runtime the index is reloaded from that file as needed. In this mode train_index_ is not available.

For production and privacy-sensitive workloads, it is strongly recommended to keep the default 'external' mode so that the underlying ANN index is not part of the public API by default.

index_store_pathstr or path-like, default=None

Target file path used when index_access='external'. The fitted ANN index is saved to this location via the backend index save method, and only the file name and metadata are stored in the estimator.

If index_access='external' and this is None, fit will automatically generate an OS-friendly unique file name of the form "<unix-timestamp>-<uuid>.<ext>" in index_base_dir (or the current working directory) and save the ANN index there.

index_base_dirstr or path-like, default=None

Base directory used when automatically generating an index file name in index_access='external' mode or when building the index on disk (on_disk_build=True). If None, the current working directory is used.

This parameter is backend-agnostic and is intended to be reused unchanged by other ANN-based imputers (for example Voyager or HNSW variants) so that all of them honour the same index storage configuration.

on_disk_buildbool, default=False

Only used when backend='annoy'. Ignored for other backends.

If True, the underlying Annoy index is built using AnnoyIndex.on_disk_build, which streams the index to a backing file during construction. This can significantly reduce peak RAM usage for very large datasets.

This only affects how the index is built. How the index is stored and accessed at runtime is still controlled by index_access ('public', 'private', 'external') and index_store_path.

n_treesint, default=-1

Number of trees in the Annoy forest. Increasing the number of trees generally improves nearest-neighbor accuracy but increases build time and memory usage.

If set to -1, the value is passed as-is to the backend index implementation, which may interpret it built dynamically until the index reaches approximately twice the number of items If -1, defaults to _n_nodes >= 2 * n_items. This situation can lead to a stochastic result.

Guidelines:

Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.

search_kint, default=-1

Backend-specific search-depth parameter.

For Annoy, this is passed as search_k to AnnoyIndex.get_nns_by_vector. Larger values inspect more nodes during search and are therefore slower but more accurate. If -1, defaults to n_trees * n_neighbors.

In Voyager Index.query() passed to as query_ef - The depth of search to perform for this query. Up to query_ef candidates will be searched through to try to find up the k nearest neighbors per query vector.

n_neighborsint, default=5

Number of neighboring samples used for imputation. Higher values produce smoother imputations but may reduce locality.

weights{‘uniform’, ‘distance’} or callable, default=’uniform’

Weighting strategy for neighbor contributions:

'uniform' : all neighbors have equal weight.
'distance' : inverse-distance weighting, where closer neighbors contribute more (w_ik = 1 / (1 + d(x_i, x_k))).
callable : custom function taking an array of distances and returning an array of weights.

metric{“angular”, “cosine”,

“euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”}, optional, default=’angular’

Distance metric used for nearest-neighbor search:

'angular' : Cosine similarity (angle only, ignores magnitude). Best for normalized embeddings (e.g., text embeddings, image features).
'euclidean' : L2 distance, defined as √Σ(xᵢ - yᵢ)². Standard geometric distance, sensitive to scale.
'manhattan' : L1 (City-block) distance, defined as Σ|xᵢ - yᵢ|. More robust to outliers than L2, still scale-sensitive.
'hamming' : Fraction or count of of differing elements. Suitable for binary or categorical features (e.g., 0/1).
'dot' : Negative inner product (-x·y). Sensitive to both direction and magnitude of vectors.

Aliases:

cosine <-> angular
euclidean <- l2, lstsq
manhattan <- l1, cityblock, taxicab
dot <-> innerproduct <- @, ., dotproduct, inner

Note that when backend='voyager' not support all metrics (such as "manhattan" or "hamming") with the voyager backend will raise ValueError.

See also

sklearn.impute.KNNImputer: Multivariate imputer that estimates missing features using nearest samples. Exact KNN-based imputer using brute-force search.
sklearn_ann.kneighbors.annoy.AnnoyTransformer: Wrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer AnnoyTransformer

Notes

For each sample \(x_i\) and feature \(j\), the imputed value is:

\[\hat{x}_{ij} = \frac{\sum_{k \in N_i} w_{ik} x_{kj}}{\sum_{k \in N_i} w_{ik}}\]

where \(N_i\) is the set of K nearest neighbors of \(x_i\), and \(w_{ik}\) is the neighbor weight:

\(w_{ik} = \\frac{1}{1 + d(x_i, x_k)}\)

ANN provides approximate neighbor search, so imputations are not exact.
Annoy uses random projections to split the vector space at each node in the tree, selecting a random hyperplane defined by two sampled points.
In index_access='public' or 'private' mode the Annoy index is kept in memory after fit for efficient queries. In index_access='external' mode the index is stored on disk and loaded on demand at transform-time.
Index creation is separate from lookup. After calling build(), no additional vectors may be added.
Index files created by Annoy are memory-mapped, allowing multiple processes to share the same data without additional memory overhead.
Annoy is optimized for scenarios with many items in moderate to high dimensional spaces where fast approximate neighbor retrieval is more important than exact results.
Annoy supports specific metrics; 'euclidean' (p=2) and 'manhattan' (p=1) are special cases of the Minkowski distance.

References

[1]

Bernhardsson, E. (2013). “ANNoy (Approximate Nearest Neighbors Oh Yeah).” Spotify AB. https://github.com/spotify/annoy

Examples

>>> import numpy as np
>>> from scikitplot.experimental import enable_aknn_imputer
>>> from scikitplot.impute import ANNImputer
>>> X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
>>> # imputer = ANNImputer(backend="voyager", n_neighbors=5, metric="euclidean")
>>> imputer = ANNImputer(n_trees=5, n_neighbors=5)
>>> imputer.fit_transform(X)
array([[1. , 2. , 5. ],
       [3. , 4. , 3. ],
       [4. , 6. , 5. ],
       [8. , 8. , 7. ]])

delete_external_index()[source]#

Delete the external index file referenced by index_path_.

This helper removes the file on disk if index_path_ is set. It does not modify index_path_ itself, so subsequent calls that rely on the file (for example _get_index_for_runtime in index_access='external' mode) will fail until the estimator is re-fitted.

Any OSError raised by os.remove will propagate to the caller.

fit(X, y=None)[source]#

Fit the imputer on X and build the underlying ANN index.

This step:

validates the input data,
records which features are completely empty,
builds the backend-specific ANN index, and
fits the missing-value indicator (if enabled).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]#: Return output feature names, including indicator features if used.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

property train_index_#

Optionally expose the fitted ANN index (Annoy or Voyager).

This attribute is only available when index_access='public'. For other values, AttributeError is raised by OutsourcedIndexMixin._get_index.

transform(X)[source]#: Impute missing values in X using approximate nearest neighbors.

ANNImputer#

This Page