Impute#

This module contains some functions related to impute.

AnnoyKNNImputer#

This module contains some functions related to AnnoyKNNImputer.

TL;DR#

Purpose: Approximate nearest-neighbors-based imputation
Import path: from scikitplot.impute import AnnoyKNNImputer
Functionality: Replaces missing values using neighbors retrieved via Annoy
Parameters: n_neighbors, n_trees, metric, optional search_k, etc.

Overview#

AnnoyKNNImputer (from impute) is an approximate nearest-neighbors imputer that uses Annoy to fill missing values in datasets. It replaces missing entries by querying the nearest neighbors of samples with missing values and computing imputations from those neighbors.

Motivation#

Unlike exact KNN imputation, using Annoy allows: - Faster neighbor retrieval in high-dimensional data - Memory-efficient indexing of large datasets - Sharing of prebuilt indexes across processes

Example: Your exact NumPy array example:

import numpy as np
from scikitplot.experimental import enable_annoyknn_imputer
from scikitplot.impute import AnnoyKNNImputer

X = np.array([[1, 2, np.nan],
              [3, 4, 3],
              [np.nan, 6, 5],
              [8, 8, 7]])

imputer = AnnoyKNNImputer(n_trees=5, n_neighbors=5)
X_imputed = imputer.fit_transform(X)

print(X_imputed)
# Output:
# array([[1., 2., 5.],
#        [3., 4., 3.],
#        [4., 6., 5.],
#        [8., 8., 7.]])

Mechanism#

Builds an Annoy index from complete samples
Queries nearest neighbors for incomplete samples
Imputes missing values based on neighbor vectors
Integer identifiers are used internally; memory allocated to max(id)+1

Notes#

Memory-efficient and fast for large datasets
Approximate neighbors; exact KNN may differ slightly
Shares indexes across processes using mmap
Behavior depends on n_trees and search_k parameters

Comparison#

Similar in usage to sklearn.impute.KNNImputer, but faster on large, high-dimensional datasets
Provides a trade-off between accuracy and speed via Annoy parameters