AnnoyKNNImputer#

class scikitplot.impute._annoy_knn.AnnoyKNNImputer(*, missing_values=nan, n_trees=10, search_k=-1, n_neighbors=5, weights='uniform', metric='angular', initial_strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False, n_jobs=-1, random_state=None)[source]#

Fast approximate vector nearest-neighbors-based imputation using the Annoy library.

This imputer replaces the exact neighbor search of :class:` ~sklearn.impute.KNNImputer` with a approximate nearest neighbor index (Annoy), providing significant scalability improvements on large datasets.

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings designed to search for points in a vector space that are close to a given query point. It originates from Spotify and uses a forest of random projection trees to enable approximate nearest neighbor queries in high-dimensional spaces.

Annoy builds memory-mapped, read-only index files that can be shared across multiple processes without duplicating memory usage. Index construction is performed separately from querying, allowing the index to be built once, saved to disk, and loaded efficiently for fast lookups. This design is well suited for large-scale recommendation systems and imputation tasks where memory efficiency and scalability are critical.

This imputer uses Annoy to identify approximate nearest neighbors for samples containing missing values and imputes those values using statistics computed from the retrieved neighbor vectors.

Parameters:

missing_valuesint, float, str, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

n_treesint, default=10

Number of trees in the Annoy forest. Increasing the number of trees generally improves nearest-neighbor accuracy but increases build time and memory usage.

If set to n_trees=-1, annoy trees are built dynamically until the index reaches approximately twice the number of items (heuristic: _n_nodes >= 2 * n_items).

Guidelines:

Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.

search_kint or None, default=-1

Number of nodes inspected during neighbor search. Larger values yield more accurate but slower queries. If -1, defaults to n_trees * n_neighbors.

n_neighborsint, default=5

Number of neighboring samples used for imputation. Higher values produce smoother imputations but may reduce locality.

weights{‘uniform’, ‘distance’} or callable, default=’uniform’

Weighting strategy for neighbor contributions:

'uniform' : all neighbors have equal weight.
'distance' : inverse-distance weighting, where closer neighbors contribute more (w_ik = 1 / (1 + d(x_i, x_k))).
callable : custom function taking an array of distances and returning an array of weights.

metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘hamming’, ‘dot’}, default=’angular’

Distance metric used for nearest-neighbor search:

'angular' : Cosine similarity (angle only, ignores magnitude). Best for normalized embeddings (e.g., text embeddings, image features).
'euclidean' : L2 distance, defined as √Σ(xᵢ - yᵢ)². Standard geometric distance, sensitive to scale.
'manhattan' : L1 (City-block) distance, defined as Σ|xᵢ - yᵢ|. More robust to outliers than L2, still scale-sensitive.
'hamming' : Fraction or count of of differing elements. Suitable for binary or categorical features (e.g., 0/1).
'dot' : Negative inner product (-x·y). Sensitive to both direction and magnitude of vectors.

initial_strategy{‘mean’, ‘median’, ‘most_frequent’, ‘constant’}, default=’mean’

Which strategy to use to initialize the missing values. Same as the strategy parameter in SimpleImputer.

fill_valuestr or numerical value, default=None

When strategy="constant", fill_value is used to replace all occurrences of missing_values. For string or object data types, fill_value must be a string. If None, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

copybool, default=True

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.

add_indicatorbool, default=False

If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

keep_empty_featuresbool, default=False

If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0.

n_jobsint or None, default=-1

Specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

random_stateint or None, default=None

Seed for Annoy’s random hyperplane generation.

Attributes:

indicator_MissingIndicator: Indicator used to add binary indicators for missing values. None if add_indicator is False.
n_features_in_int: Number of features seen during fit.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.
annoy_index_Fitted Annoy Index.
fill_annoy_vector_1d nanmean array for fill annoy vector.

See also

sklearn.impute.KNNImputer: Multivariate imputer that estimates missing features using nearest samples. Exact KNN-based imputer using brute-force search.
sklearn_ann.kneighbors.annoy.AnnoyTransformer: Wrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer AnnoyTransformer

Notes

For each sample \(x_i\) and feature \(j\), the imputed value is:

\[\hat{x}_{ij} = \frac{\sum_{k \in N_i} w_{ik} x_{kj}}{\sum_{k \in N_i} w_{ik}}\]

where \(N_i\) is the set of K nearest neighbors of \(x_i\), and \(w_{ik}\) is the neighbor weight:

\(w_{ik} = \\frac{1}{1 + d(x_i, x_k)}\)

Annoy provides approximate neighbor search, so imputations are not exact.
Annoy uses random projections to split the vector space at each node in the tree, selecting a random hyperplane defined by two sampled points.
The index remains in memory after fitting for efficient queries.
Index creation is separate from lookup. After calling build(), no additional vectors may be added.
Index files created by Annoy are memory-mapped, allowing multiple processes to share the same data without additional memory overhead.
Annoy is optimized for scenarios with many items in moderate to high dimensional spaces where fast approximate neighbor retrieval is more important than exact results.
Annoy supports specific metrics; 'euclidean' (p=2) and 'manhattan' (p=1) are special cases of the Minkowski distance.

References

[1]

Bernhardsson, E. (2013). “Annoy: Approximate Nearest Neighbors Oh Yeah.” Spotify AB. https://github.com/spotify/annoy

Examples

>>> import numpy as np
>>> from scikitplot.experimental import enable_annoyknn_imputer
>>> from scikitplot.impute import AnnoyKNNImputer
>>> X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
>>> imputer = AnnoyKNNImputer(n_trees=5, n_neighbors=5)
>>> imputer.fit_transform(X)
array([[1. , 2. , 5. ],
       [3. , 4. , 3. ],
       [4. , 6. , 5. ],
       [8. , 8. , 7. ]])

fit(X, y=None)[source]#: Fit the imputer on X and build Annoy index.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]#: Return output feature names, including indicator features if used.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#: Impute missing values in X using approximate nearest neighbors.

AnnoyKNNImputer#

This Page