AnnoyKNNImputer#
- class scikitplot.impute._annoy_knn.AnnoyKNNImputer(*, missing_values=nan, n_trees=10, search_k=-1, n_neighbors=5, weights='uniform', metric='angular', initial_strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False, n_jobs=-1, random_state=None)[source]#
Fast approximate vector nearest-neighbors-based imputation using the Annoy library.
This imputer replaces the exact neighbor search of :class:` ~sklearn.impute.KNNImputer` with a approximate nearest neighbor index (Annoy), providing significant scalability improvements on large datasets.
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings designed to search for points in a vector space that are close to a given query point. It originates from Spotify and uses a forest of random projection trees to enable approximate nearest neighbor queries in high-dimensional spaces.
Annoy builds memory-mapped, read-only index files that can be shared across multiple processes without duplicating memory usage. Index construction is performed separately from querying, allowing the index to be built once, saved to disk, and loaded efficiently for fast lookups. This design is well suited for large-scale recommendation systems and imputation tasks where memory efficiency and scalability are critical.
This imputer uses Annoy to identify approximate nearest neighbors for samples containing missing values and imputes those values using statistics computed from the retrieved neighbor vectors.
- Parameters:
- missing_valuesint, float, str, np.nan or None, default=np.nan
The placeholder for the missing values. All occurrences of
missing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set to np.nan, sincepd.NAwill be converted to np.nan.- n_treesint, default=10
Number of trees in the Annoy forest. Increasing the number of trees generally improves nearest-neighbor accuracy but increases build time and memory usage.
If set to
n_trees=-1, annoy trees are built dynamically until the index reaches approximately twice the number of items (heuristic:_n_nodes >= 2 * n_items).Guidelines:
Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.
- search_kint or None, default=-1
Number of nodes inspected during neighbor search. Larger values yield more accurate but slower queries. If -1, defaults to
n_trees * n_neighbors.- n_neighborsint, default=5
Number of neighboring samples used for imputation. Higher values produce smoother imputations but may reduce locality.
- weights{‘uniform’, ‘distance’} or callable, default=’uniform’
Weighting strategy for neighbor contributions:
'uniform': all neighbors have equal weight.'distance': inverse-distance weighting, where closer neighbors contribute more (w_ik = 1 / (1 + d(x_i, x_k))).callable : custom function taking an array of distances and returning an array of weights.
- metric{‘angular’, ‘euclidean’, ‘manhattan’, ‘hamming’, ‘dot’}, default=’angular’
Distance metric used for nearest-neighbor search:
'angular': Cosine similarity (angle only, ignores magnitude). Best for normalized embeddings (e.g., text embeddings, image features).'euclidean': L2 distance, defined as √Σ(xᵢ - yᵢ)². Standard geometric distance, sensitive to scale.'manhattan': L1 (City-block) distance, defined as Σ|xᵢ - yᵢ|. More robust to outliers than L2, still scale-sensitive.'hamming': Fraction or count of of differing elements. Suitable for binary or categorical features (e.g., 0/1).'dot': Negative inner product (-x·y). Sensitive to both direction and magnitude of vectors.
- initial_strategy{‘mean’, ‘median’, ‘most_frequent’, ‘constant’}, default=’mean’
Which strategy to use to initialize the missing values. Same as the
strategyparameter inSimpleImputer.- fill_valuestr or numerical value, default=None
When
strategy="constant",fill_valueis used to replace all occurrences of missing_values. For string or object data types,fill_valuemust be a string. IfNone,fill_valuewill be 0 when imputing numerical data and “missing_value” for strings or object data types.- copybool, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.
- add_indicatorbool, default=False
If True, a
MissingIndicatortransform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.- keep_empty_featuresbool, default=False
If True, features that consist exclusively of missing values when
fitis called are returned in results whentransformis called. The imputed value is always0.- n_jobsint or None, default=-1
Specifies the number of threads used to build the trees.
n_jobs=-1uses all available CPU cores.- random_stateint or None, default=None
Seed for Annoy’s random hyperplane generation.
- Attributes:
- indicator_
MissingIndicator Indicator used to add binary indicators for missing values.
Noneif add_indicator is False.- n_features_in_int
Number of features seen during fit.
- feature_names_in_ndarray of shape (
n_features_in_,) Names of features seen during fit. Defined only when
Xhas feature names that are all strings.- annoy_index_Fitted Annoy Index.
- fill_annoy_vector_1d nanmean array for fill annoy vector.
- indicator_
See also
sklearn.impute.KNNImputerMultivariate imputer that estimates missing features using nearest samples. Exact KNN-based imputer using brute-force search.
sklearn_ann.kneighbors.annoy.AnnoyTransformerWrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer AnnoyTransformer
Notes
For each sample \(x_i\) and feature \(j\), the imputed value is:
\[\hat{x}_{ij} = \frac{\sum_{k \in N_i} w_{ik} x_{kj}}{\sum_{k \in N_i} w_{ik}}\]where \(N_i\) is the set of K nearest neighbors of \(x_i\), and \(w_{ik}\) is the neighbor weight:
\(w_{ik} = \\frac{1}{1 + d(x_i, x_k)}\)
Annoy provides approximate neighbor search, so imputations are not exact.
Annoy uses random projections to split the vector space at each node in the tree, selecting a random hyperplane defined by two sampled points.
The index remains in memory after fitting for efficient queries.
Index creation is separate from lookup. After calling
build(), no additional vectors may be added.Index files created by Annoy are memory-mapped, allowing multiple processes to share the same data without additional memory overhead.
Annoy is optimized for scenarios with many items in moderate to high dimensional spaces where fast approximate neighbor retrieval is more important than exact results.
Annoy supports specific metrics;
'euclidean'(p=2) and'manhattan'(p=1) are special cases of the Minkowski distance.
References
Examples
>>> import numpy as np >>> from scikitplot.experimental import enable_annoyknn_imputer >>> from scikitplot.impute import AnnoyKNNImputer >>> X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]) >>> imputer = AnnoyKNNImputer(n_trees=5, n_neighbors=5) >>> imputer.fit_transform(X) array([[1. , 2. , 5. ], [3. , 4. , 3. ], [4. , 6. , 5. ], [8. , 8. , 7. ]])
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)[source]#
Return output feature names, including indicator features if used.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of
transformandfit_transform."default": Default output format of a transformer"pandas": DataFrame output"polars": Polars outputNone: Transform configuration is unchanged
Added in version 1.4:
"polars"option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.