ANNImputer#
- class scikitplot.impute._ann.ANNImputer(*, missing_values=nan, backend='annoy', index_access='external', index_store_path=None, index_base_dir=None, on_disk_build=False, n_trees=-1, search_k=-1, n_neighbors=5, weights='uniform', metric='angular', initial_strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False, n_jobs=-1, random_state=None)[source]#
Approximate K-nearest-neighbours (KNN) imputer with pluggable ANN backends.
ANNImputerperforms vector-based imputation by querying an approximate nearest-neighbours (ANN) index instead of using exact brute-force distances as inKNNImputer.Two backends are currently supported:
backend='annoy'(default): uses the Spotify Annoy library and the in-treeIndexwrapper.backend='voyager': uses the optionalvoyagerpackage (HNSW-based index).
All high-level imputation parameters (
n_neighbors,weights,metric,index_access,index_store_path,index_base_dir) are shared across backends. Backend-specific details (such as the Annoy forest size) are handled internally.This imputer identifies approximate nearest neighbors for samples containing missing values and imputes those values using statistics computed from the retrieved neighbor vectors.
- Parameters:
- missing_valuesint, float, str, np.nan or None, default=np.nan
The placeholder for the missing values. All occurrences of
missing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set to np.nan, sincepd.NAwill be converted to np.nan.- backend{‘annoy’, ‘voyager’}, default=’annoy’
Name of the approximate nearest-neighbour backend to use.
'annoy': use the modified Spotify Annoy library (in-tree wrapper).'voyager': use the optionalvoyagerpackage.
When
backend='voyager'thevoyagerpackage must be installed. AnImportErroris raised otherwise.Parameters
index_access,index_store_pathandindex_base_dirbehave identically for both backends.- index_access{‘public’, ‘private’, ‘external’}, default=’external’
Controls whether and how the fitted ANN index is exposed or stored.
'public':train_index_returns the underlying ANN index instance (backwards compatible behaviour).'private': Any attempt to accesstrain_index_raisesAttributeError. The index is still used internally duringtransform, but is not directly exposed to user code.'external': The fitted ANN index is persisted to disk using the backend index’ssavemethod (e.g.AnnoyIndex.saveorvoyager.Index.save) and only the file name (index_path_) and metadata (index_created_at_) are stored on the estimator. At runtime the index is reloaded from that file as needed. In this modetrain_index_is not available.
For production and privacy-sensitive workloads, it is strongly recommended to keep the default
'external'mode so that the underlying ANN index is not part of the public API by default.- index_store_pathstr or path-like, default=None
Target file path used when
index_access='external'. The fitted ANN index is saved to this location via the backend indexsavemethod, and only the file name and metadata are stored in the estimator.If
index_access='external'and this isNone,fitwill automatically generate an OS-friendly unique file name of the form"<unix-timestamp>-<uuid>.<ext>"inindex_base_dir(or the current working directory) and save the ANN index there.- index_base_dirstr or path-like, default=None
Base directory used when automatically generating an index file name in
index_access='external'mode or when building the index on disk (on_disk_build=True). IfNone, the current working directory is used.This parameter is backend-agnostic and is intended to be reused unchanged by other ANN-based imputers (for example Voyager or HNSW variants) so that all of them honour the same index storage configuration.
- on_disk_buildbool, default=False
Only used when
backend='annoy'. Ignored for other backends.If
True, the underlying Annoy index is built usingAnnoyIndex.on_disk_build, which streams the index to a backing file during construction. This can significantly reduce peak RAM usage for very large datasets.This only affects how the index is built. How the index is stored and accessed at runtime is still controlled by
index_access('public','private','external') andindex_store_path.- n_treesint, default=-1
Number of trees in the Annoy forest. Increasing the number of trees generally improves nearest-neighbor accuracy but increases build time and memory usage.
If set to
-1, the value is passed as-is to the backend index implementation, which may interpret it built dynamically until the index reaches approximately twice the number of items If -1, defaults to_n_nodes >= 2 * n_items. This situation can lead to a stochastic result.Guidelines:
Small datasets (<10k samples): 10-20 trees.
Medium datasets (10k-1M samples): 20-50 trees.
Large datasets (>1M samples): 50-100+ trees.
- search_kint, default=-1
Backend-specific search-depth parameter.
For Annoy, this is passed as
search_ktoAnnoyIndex.get_nns_by_vector. Larger values inspect more nodes during search and are therefore slower but more accurate. If -1, defaults ton_trees * n_neighbors.In Voyager Index.query() passed to as query_ef - The depth of search to perform for this query. Up to query_ef candidates will be searched through to try to find up the k nearest neighbors per query vector.
- n_neighborsint, default=5
Number of neighboring samples used for imputation. Higher values produce smoother imputations but may reduce locality.
- weights{‘uniform’, ‘distance’} or callable, default=’uniform’
Weighting strategy for neighbor contributions:
'uniform': all neighbors have equal weight.'distance': inverse-distance weighting, where closer neighbors contribute more (w_ik = 1 / (1 + d(x_i, x_k))).callable : custom function taking an array of distances and returning an array of weights.
- metric{“angular”, “cosine”,
“euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”}, optional, default=’angular’
Distance metric used for nearest-neighbor search:
'angular': Cosine similarity (angle only, ignores magnitude). Best for normalized embeddings (e.g., text embeddings, image features).'euclidean': L2 distance, defined as √Σ(xᵢ - yᵢ)². Standard geometric distance, sensitive to scale.'manhattan': L1 (City-block) distance, defined as Σ|xᵢ - yᵢ|. More robust to outliers than L2, still scale-sensitive.'hamming': Fraction or count of of differing elements. Suitable for binary or categorical features (e.g., 0/1).'dot': Negative inner product (-x·y). Sensitive to both direction and magnitude of vectors.
Aliases:
cosine <-> angular
euclidean <- l2, lstsq
manhattan <- l1, cityblock, taxicab
dot <-> innerproduct <- @, ., dotproduct, inner
Note that when
backend='voyager'not support all metrics (such as"manhattan"or"hamming") with the voyager backend will raiseValueError.See also
cosineeuclideancityblockdothamming- initial_strategy{‘mean’, ‘median’, ‘most_frequent’, ‘constant’}, default=’mean’
Which strategy to use to initialize the missing values when building the ANN index. This is analogous to the
strategyparameter inSimpleImputer:'mean': use the column-wise mean (ignoring NaNs).'median': use the column-wise median (ignoring NaNs).'most_frequent': use the column-wise mode (most frequent value, ignoring NaNs; if a column has no observed values, it falls back to 0.0).'constant': usefill_valuefor all features. Iffill_valueisNone, a default of0.0is used for numeric data.
This strategy affects only the temporary fill vector used to build the ANN index and the global fallback used when no valid neighbor values are available. The main imputation logic is still k-nearest-neighbours based on the ANN index.
- fill_valuestr or numerical value, default=None
When
strategy="constant",fill_valueis used to replace all occurrences of missing_values. For string or object data types,fill_valuemust be a string. IfNone,fill_valuewill be 0 when imputing numerical data and “missing_value” for strings or object data types.- copybool, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.
- add_indicatorbool, default=False
If True, a
MissingIndicatortransform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.- keep_empty_featuresbool, default=False
If True, features that consist exclusively of missing values when
fitis called are returned in results whentransformis called. The imputed value is always0.- n_jobsint or None, default=-1
Parallelism level used in two places:
during Annoy index construction, passed to
AnnoyIndex.build,during Voyager query construction, passed to
Voyager.query,during imputation, used as the number of worker threads in a
joblib.Parallelloop.
A value of
-1uses all available CPU cores. Using threads for the imputation step avoids spawning new Python processes and keeps this estimator compatible with editable installs and other environments where the package cannot be safely re-imported in child processes.- random_stateint or None, default=None
Seed for the backend index construction (e.g. Annoy hyperplanes, Voyager graph initialization).
- Attributes:
- indicator_
MissingIndicator Indicator used to add binary indicators for missing values.
Noneif add_indicator is False.- n_features_in_int
Number of features seen during fit.
- feature_names_in_ndarray of shape (
n_features_in_,) Names of features seen during fit. Defined only when
Xhas feature names that are all strings.- temp_fill_vector_ndarray of shape (n_features_in_,)
Per-feature statistics (e.g. mean or median) used to temporarily fill missing values when building the ANN index and as a fallback when neighbor information is not available.
- index_path_str
File path of the persisted ANN index when
index_access='external'. Only set afterfit.- index_created_at_str
UTC ISO 8601 timestamp recording when the ANN index was persisted to
index_path_inindex_access='external'mode.train_index_object or propertyOptionally expose the fitted ANN index (Annoy or Voyager).
- .. warning::
index_access='private'orindex_access='external'prevents access to the underlying ANN index through the public API (for exampletrain_index_). This protects against accidental leaks and misuse, but it is not a hard security boundary: any Python code running in the same process can still inspect the estimator using introspection facilities.If you need strong confidentiality for the training data or the ANN index, do not share the
ANNImputerinstance with untrusted code. Instead, run it inside a separate process or service and expose only a high-level API (e.g. an/imputeendpoint) rather than the Python object itself (model-as-a-service pattern).
- indicator_
See also
sklearn.impute.KNNImputerMultivariate imputer that estimates missing features using nearest samples. Exact KNN-based imputer using brute-force search.
sklearn_ann.kneighbors.annoy.AnnoyTransformerWrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer AnnoyTransformer
Notes
For each sample \(x_i\) and feature \(j\), the imputed value is:
\[\hat{x}_{ij} = \frac{\sum_{k \in N_i} w_{ik} x_{kj}}{\sum_{k \in N_i} w_{ik}}\]where \(N_i\) is the set of K nearest neighbors of \(x_i\), and \(w_{ik}\) is the neighbor weight:
\(w_{ik} = \\frac{1}{1 + d(x_i, x_k)}\)
ANN provides approximate neighbor search, so imputations are not exact.
Annoy uses random projections to split the vector space at each node in the tree, selecting a random hyperplane defined by two sampled points.
In
index_access='public'or'private'mode the Annoy index is kept in memory afterfitfor efficient queries. Inindex_access='external'mode the index is stored on disk and loaded on demand at transform-time.Index creation is separate from lookup. After calling
build(), no additional vectors may be added.Index files created by Annoy are memory-mapped, allowing multiple processes to share the same data without additional memory overhead.
Annoy is optimized for scenarios with many items in moderate to high dimensional spaces where fast approximate neighbor retrieval is more important than exact results.
Annoy supports specific metrics;
'euclidean'(p=2) and'manhattan'(p=1) are special cases of the Minkowski distance.
References
Examples
>>> import numpy as np >>> from scikitplot.experimental import enable_aknn_imputer >>> from scikitplot.impute import ANNImputer >>> X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]) >>> # imputer = ANNImputer(backend="voyager", n_neighbors=5, metric="euclidean") >>> imputer = ANNImputer(n_trees=5, n_neighbors=5) >>> imputer.fit_transform(X) array([[1. , 2. , 5. ], [3. , 4. , 3. ], [4. , 6. , 5. ], [8. , 8. , 7. ]])
- delete_external_index()[source]#
Delete the external index file referenced by
index_path_.This helper removes the file on disk if
index_path_is set. It does not modifyindex_path_itself, so subsequent calls that rely on the file (for example_get_index_for_runtimeinindex_access='external'mode) will fail until the estimator is re-fitted.Any
OSErrorraised byos.removewill propagate to the caller.
- fit(X, y=None)[source]#
Fit the imputer on X and build the underlying ANN index.
This step:
validates the input data,
records which features are completely empty,
builds the backend-specific ANN index, and
fits the missing-value indicator (if enabled).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters. Pass only if the estimator accepts additional params in its
fitmethod.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)[source]#
Return output feature names, including indicator features if used.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of
transformandfit_transform."default": Default output format of a transformer"pandas": DataFrame output"polars": Polars outputNone: Transform configuration is unchanged
Added in version 1.4:
"polars"option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- property train_index_#
Optionally expose the fitted ANN index (Annoy or Voyager).
This attribute is only available when
index_access='public'. For other values,AttributeErroris raised byOutsourcedIndexMixin._get_index.