ANNImputer#

class scikitplot.impute._ann.ANNImputer(*, missing_values=nan, backend='annoy', index_access='external', index_store_path=None, index_base_dir=None, on_disk_build=False, n_trees=-1, search_k=-1, n_neighbors=5, weights='uniform', metric='angular', initial_strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False, n_jobs=-1, random_state=None)[source]#

Approximate K-nearest-neighbours (KNN) imputer with pluggable ANN backends.

ANNImputer performs vector-based imputation by querying an approximate nearest-neighbours (ANN) index instead of using exact brute-force distances as in KNNImputer.

Two backends are currently supported:

  • backend='annoy' (default): uses the Spotify Annoy library and the in-tree Index wrapper.

  • backend='voyager': uses the optional voyager package (HNSW-based index).

All high-level imputation parameters (n_neighbors, weights, metric, index_access, index_store_path, index_base_dir) are shared across backends. Backend-specific details (such as the Annoy forest size) are handled internally.

This imputer identifies approximate nearest neighbors for samples containing missing values and imputes those values using statistics computed from the retrieved neighbor vectors.

Parameters:
missing_valuesint, float, str, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

backend{‘annoy’, ‘voyager’}, default=’annoy’

Name of the approximate nearest-neighbour backend to use.

  • 'annoy': use the modified Spotify Annoy library (in-tree wrapper).

  • 'voyager': use the optional voyager package.

When backend='voyager' the voyager package must be installed. An ImportError is raised otherwise.

Parameters index_access, index_store_path and index_base_dir behave identically for both backends.

index_access{‘public’, ‘private’, ‘external’}, default=’external’

Controls whether and how the fitted ANN index is exposed or stored.

  • 'public': train_index_ returns the underlying ANN index instance (backwards compatible behaviour).

  • 'private': Any attempt to access train_index_ raises AttributeError. The index is still used internally during transform, but is not directly exposed to user code.

  • 'external': The fitted ANN index is persisted to disk using the backend index’s save method (e.g. AnnoyIndex.save or voyager.Index.save) and only the file name (index_path_) and metadata (index_created_at_) are stored on the estimator. At runtime the index is reloaded from that file as needed. In this mode train_index_ is not available.

For production and privacy-sensitive workloads, it is strongly recommended to keep the default 'external' mode so that the underlying ANN index is not part of the public API by default.

index_store_pathstr or path-like, default=None

Target file path used when index_access='external'. The fitted ANN index is saved to this location via the backend index save method, and only the file name and metadata are stored in the estimator.

If index_access='external' and this is None, fit will automatically generate an OS-friendly unique file name of the form "<unix-timestamp>-<uuid>.<ext>" in index_base_dir (or the current working directory) and save the ANN index there.

index_base_dirstr or path-like, default=None

Base directory used when automatically generating an index file name in index_access='external' mode or when building the index on disk (on_disk_build=True). If None, the current working directory is used.

This parameter is backend-agnostic and is intended to be reused unchanged by other ANN-based imputers (for example Voyager or HNSW variants) so that all of them honour the same index storage configuration.

on_disk_buildbool, default=False

Only used when backend='annoy'. Ignored for other backends.

If True, the underlying Annoy index is built using AnnoyIndex.on_disk_build, which streams the index to a backing file during construction. This can significantly reduce peak RAM usage for very large datasets.

This only affects how the index is built. How the index is stored and accessed at runtime is still controlled by index_access ('public', 'private', 'external') and index_store_path.

n_treesint, default=-1

Number of trees in the Annoy forest. Increasing the number of trees generally improves nearest-neighbor accuracy but increases build time and memory usage.

If set to -1, the value is passed as-is to the backend index implementation, which may interpret it built dynamically until the index reaches approximately twice the number of items If -1, defaults to _n_nodes >= 2 * n_items. This situation can lead to a stochastic result.

Guidelines:

  • Small datasets (<10k samples): 10-20 trees.

  • Medium datasets (10k-1M samples): 20-50 trees.

  • Large datasets (>1M samples): 50-100+ trees.

search_kint, default=-1

Backend-specific search-depth parameter.

For Annoy, this is passed as search_k to AnnoyIndex.get_nns_by_vector. Larger values inspect more nodes during search and are therefore slower but more accurate. If -1, defaults to n_trees * n_neighbors.

In Voyager Index.query() passed to as query_ef - The depth of search to perform for this query. Up to query_ef candidates will be searched through to try to find up the k nearest neighbors per query vector.

n_neighborsint, default=5

Number of neighboring samples used for imputation. Higher values produce smoother imputations but may reduce locality.

weights{‘uniform’, ‘distance’} or callable, default=’uniform’

Weighting strategy for neighbor contributions:

  • 'uniform' : all neighbors have equal weight.

  • 'distance' : inverse-distance weighting, where closer neighbors contribute more (w_ik = 1 / (1 + d(x_i, x_k))).

  • callable : custom function taking an array of distances and returning an array of weights.

metric{“angular”, “cosine”,

“euclidean”, “l2”, “lstsq”, “manhattan”, “l1”, “cityblock”, “taxicab”, “dot”, “@”, “.”, “dotproduct”, “inner”, “innerproduct”, “hamming”}, optional, default=’angular’

Distance metric used for nearest-neighbor search:

  • 'angular' : Cosine similarity (angle only, ignores magnitude). Best for normalized embeddings (e.g., text embeddings, image features).

  • 'euclidean' : L2 distance, defined as √Σ(xᵢ - yᵢ)². Standard geometric distance, sensitive to scale.

  • 'manhattan' : L1 (City-block) distance, defined as Σ|xᵢ - yᵢ|. More robust to outliers than L2, still scale-sensitive.

  • 'hamming' : Fraction or count of of differing elements. Suitable for binary or categorical features (e.g., 0/1).

  • 'dot' : Negative inner product (-x·y). Sensitive to both direction and magnitude of vectors.

Aliases:

  • cosine <-> angular

  • euclidean <- l2, lstsq

  • manhattan <- l1, cityblock, taxicab

  • dot <-> innerproduct <- @, ., dotproduct, inner

Note that when backend='voyager' not support all metrics (such as "manhattan" or "hamming") with the voyager backend will raise ValueError.

See also

cosine euclidean cityblock dot hamming

initial_strategy{‘mean’, ‘median’, ‘most_frequent’, ‘constant’}, default=’mean’

Which strategy to use to initialize the missing values when building the ANN index. This is analogous to the strategy parameter in SimpleImputer:

  • 'mean': use the column-wise mean (ignoring NaNs).

  • 'median': use the column-wise median (ignoring NaNs).

  • 'most_frequent': use the column-wise mode (most frequent value, ignoring NaNs; if a column has no observed values, it falls back to 0.0).

  • 'constant': use fill_value for all features. If fill_value is None, a default of 0.0 is used for numeric data.

This strategy affects only the temporary fill vector used to build the ANN index and the global fallback used when no valid neighbor values are available. The main imputation logic is still k-nearest-neighbours based on the ANN index.

fill_valuestr or numerical value, default=None

When strategy="constant", fill_value is used to replace all occurrences of missing_values. For string or object data types, fill_value must be a string. If None, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

copybool, default=True

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.

add_indicatorbool, default=False

If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

keep_empty_featuresbool, default=False

If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0.

n_jobsint or None, default=-1

Parallelism level used in two places:

  • during Annoy index construction, passed to AnnoyIndex.build,

  • during Voyager query construction, passed to Voyager.query,

  • during imputation, used as the number of worker threads in a joblib.Parallel loop.

A value of -1 uses all available CPU cores. Using threads for the imputation step avoids spawning new Python processes and keeps this estimator compatible with editable installs and other environments where the package cannot be safely re-imported in child processes.

random_stateint or None, default=None

Seed for the backend index construction (e.g. Annoy hyperplanes, Voyager graph initialization).

Attributes:
indicator_MissingIndicator

Indicator used to add binary indicators for missing values. None if add_indicator is False.

n_features_in_int

Number of features seen during fit.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

temp_fill_vector_ndarray of shape (n_features_in_,)

Per-feature statistics (e.g. mean or median) used to temporarily fill missing values when building the ANN index and as a fallback when neighbor information is not available.

index_path_str

File path of the persisted ANN index when index_access='external'. Only set after fit.

index_created_at_str

UTC ISO 8601 timestamp recording when the ANN index was persisted to index_path_ in index_access='external' mode.

train_index_object or property

Optionally expose the fitted ANN index (Annoy or Voyager).

.. warning::

index_access='private' or index_access='external' prevents access to the underlying ANN index through the public API (for example train_index_). This protects against accidental leaks and misuse, but it is not a hard security boundary: any Python code running in the same process can still inspect the estimator using introspection facilities.

If you need strong confidentiality for the training data or the ANN index, do not share the ANNImputer instance with untrusted code. Instead, run it inside a separate process or service and expose only a high-level API (e.g. an /impute endpoint) rather than the Python object itself (model-as-a-service pattern).

See also

sklearn.impute.KNNImputer

Multivariate imputer that estimates missing features using nearest samples. Exact KNN-based imputer using brute-force search.

sklearn_ann.kneighbors.annoy.AnnoyTransformer

Wrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer AnnoyTransformer

Notes

For each sample \(x_i\) and feature \(j\), the imputed value is:

\[\hat{x}_{ij} = \frac{\sum_{k \in N_i} w_{ik} x_{kj}}{\sum_{k \in N_i} w_{ik}}\]

where \(N_i\) is the set of K nearest neighbors of \(x_i\), and \(w_{ik}\) is the neighbor weight:

\(w_{ik} = \\frac{1}{1 + d(x_i, x_k)}\)

  • ANN provides approximate neighbor search, so imputations are not exact.

  • Annoy uses random projections to split the vector space at each node in the tree, selecting a random hyperplane defined by two sampled points.

  • In index_access='public' or 'private' mode the Annoy index is kept in memory after fit for efficient queries. In index_access='external' mode the index is stored on disk and loaded on demand at transform-time.

  • Index creation is separate from lookup. After calling build(), no additional vectors may be added.

  • Index files created by Annoy are memory-mapped, allowing multiple processes to share the same data without additional memory overhead.

  • Annoy is optimized for scenarios with many items in moderate to high dimensional spaces where fast approximate neighbor retrieval is more important than exact results.

  • Annoy supports specific metrics; 'euclidean' (p=2) and 'manhattan' (p=1) are special cases of the Minkowski distance.

References

Examples

>>> import numpy as np
>>> from scikitplot.experimental import enable_aknn_imputer
>>> from scikitplot.impute import ANNImputer
>>> X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
>>> # imputer = ANNImputer(backend="voyager", n_neighbors=5, metric="euclidean")
>>> imputer = ANNImputer(n_trees=5, n_neighbors=5)
>>> imputer.fit_transform(X)
array([[1. , 2. , 5. ],
       [3. , 4. , 3. ],
       [4. , 6. , 5. ],
       [8. , 8. , 7. ]])
delete_external_index()[source]#

Delete the external index file referenced by index_path_.

This helper removes the file on disk if index_path_ is set. It does not modify index_path_ itself, so subsequent calls that rely on the file (for example _get_index_for_runtime in index_access='external' mode) will fail until the estimator is re-fitted.

Any OSError raised by os.remove will propagate to the caller.

fit(X, y=None)[source]#

Fit the imputer on X and build the underlying ANN index.

This step:

  • validates the input data,

  • records which features are completely empty,

  • builds the backend-specific ANN index, and

  • fits the missing-value indicator (if enabled).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names_out(input_features=None)[source]#

Return output feature names, including indicator features if used.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

property train_index_#

Optionally expose the fitted ANN index (Annoy or Voyager).

This attribute is only available when index_access='public'. For other values, AttributeError is raised by OutsourcedIndexMixin._get_index.

transform(X)[source]#

Impute missing values in X using approximate nearest neighbors.