πŸ“– Terminology Reference
Your complete guide to ML & Data Science concepts in scikit-plots
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC

Terminology#

This reference organises every machine-learning and data-science term you will encounter when using scikit-plots β€” from the most elementary ideas (What is a True Positive?) to expert-level subtleties (Macro-averaged AUROC in imbalanced multiclass problems).

Each entry answers four questions concisely:

  • What is it? β€” a plain-English definition.

  • Formula / Key Relationship β€” the exact mathematical statement.

  • When to use it β€” the right context.

  • scikit-plots / scikit-learn connection β€” the API call that produces or consumes this concept.

Use the level tabs below to start at the depth that suits you, then follow cross-references to go deeper.

Note

Terms are grouped by domain, not alphabetically, so that related concepts appear together. Use your browser’s Ctrl + F or the Sphinx search to jump to a specific term.


Discovery at a Glance#

Core building blocks every practitioner must know. No formulas required β€” just intuition.

πŸ“‹ Confusion Matrix

The 2Γ—2 (or KΓ—K) table that underpins every classification metric. Start here.

Domain 1 β€” Confusion Matrix & Core Metrics
🎯 Precision & Recall

The fundamental trade-off: catching more positives vs. trusting your predictions.

Precision (Positive Predictive Value)
βš–οΈ F1 Score

The harmonic mean of precision and recall β€” a single number that balances both.

F1 Score
🏷️ Classification Types

Binary, multiclass, multi-label β€” which problem are you actually solving?

Binary Β· Multiclass Β· Multi-label Classification
πŸ“‰ Data Imbalance

When one class dominates β€” oversampling, undersampling, and class weighting.

Domain 4 β€” Class Imbalance & Sampling Strategies
πŸ“ˆ ROC Curve

The performance landscape across every classification threshold at once.

Domain 2 β€” ROC Curve & AUROC

Intermediate concepts for practitioners building and evaluating real models.

πŸ“ Averaging Strategies

Macro vs. Micro vs. Weighted β€” how single numbers are derived from per-class scores.

Domain 3 β€” Averaging Strategies & Multiclass Metrics
πŸ”’ Multiclass AUROC

Extending the ROC framework from two classes to K classes with OvR and OvO strategies.

Domain 3 β€” Averaging Strategies & Multiclass Metrics
πŸ§ͺ SMOTE & Sampling

Synthetic minority oversampling, NearMiss, cluster-based strategies β€” when and how.

Oversampling
πŸ“Š Statistical Tests

Bootstrap CIs, Mann-Whitney U, and other tools for comparing models rigorously.

Domain 5 β€” Statistical Foundations
πŸŽ›οΈ Calibration

Does PΜ‚ = 0.8 really mean 80 % likely? Reliability diagrams and calibration curves.

Domain 7 β€” Calibration
⚑ Signal Processing

Subsampling, downsampling, aliasing, low-pass filtering β€” for time-series and DSP work.

Domain 8 β€” Signal Processing & Time Series

Nuanced topics for senior practitioners, researchers, and contributors.

βš–οΈ Fairness Metrics

Demographic parity, equal opportunity, equalized odds, predictive parity β€” choosing the right fairness criterion.

Domain 6 β€” Fairness & Bias Metrics
πŸ”„ OvR vs. OvO

One-vs-Rest and One-vs-One decomposition strategies and their impact on AUROC computation.

One-vs-Rest (OvR) and One-vs-One (OvO)
πŸ“‰ Gini Coefficient

The relationship between Gini index and AUROC β€” and when Gini is the preferred reporting metric.

Gini Coefficient (in ML context)
🧬 Bootstrap CIs

Constructing confidence intervals for any metric without parametric assumptions.

Bootstrap Confidence Intervals
πŸ”¬ Imbalance + Fairness

When class imbalance interacts with group fairness β€” the hidden pitfalls.

Domain 6 β€” Fairness & Bias Metrics
πŸ“‘ Aliasing & Nyquist

Why subsampling without a low-pass filter corrupts signals β€” the Nyquist-Shannon theorem.

Aliasing & the Nyquist-Shannon Theorem

Domain 1 β€” Confusion Matrix & Core Metrics#

The confusion matrix is the single most important data structure in classification evaluation. All threshold-based metrics derive from its four cells.

Confusion Matrix#

What is it?

A square table that tallies the agreement and disagreement between a classifier’s predicted labels and the true labels on a held-out dataset. For a binary problem it has four cells:

Cell

Full Name

Meaning

TP

True Positive

Predicted positive, actually positive

TN

True Negative

Predicted negative, actually negative

FP

False Positive (Type I error)

Predicted positive, actually negative

FN

False Negative (Type II error)

Predicted negative, actually positive

For a K-class problem the matrix is KΓ—K: row i, column j counts samples with true label i predicted as label j. The diagonal contains correct predictions.

scikit-plots connection

from sklearn.metrics import confusion_matrix
import scikitplot as skplt

# Plot normalised confusion matrix
skplt.metrics.plot_confusion_matrix(
    y_true, y_pred, normalize=True
)

When to use it

Always β€” it is the foundation for every derived metric. Inspect the raw counts before trusting any single-number summary.

True Positive Β· True Negative Β· False Positive Β· False Negative#

Definitions

These four quantities are the atoms of classification evaluation.

Symbol

Intuition

Domain example (disease screening)

TP

Correct positive detection

Test says β€œsick”, patient is sick βœ…

TN

Correct negative detection

Test says β€œhealthy”, patient is healthy βœ…

FP

False alarm

Test says β€œsick”, patient is healthy ❌

FN

Missed detection

Test says β€œhealthy”, patient is sick ❌

The cost of FP and FN is domain-specific β€” in fraud detection, FN (missed fraud) is often far costlier than FP (flagging a legitimate transaction). Always decide which error is worse before selecting a threshold.

Accuracy#

Formula

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

Intuition β€” the fraction of all predictions that are correct.

When NOT to use it

Accuracy is misleading for imbalanced datasets. A model that predicts the majority class for every sample achieves high accuracy while being completely useless. Prefer F1, balanced accuracy, AUROC, or class-specific metrics when class frequencies differ substantially.

scikit-learn

from sklearn.metrics import accuracy_score, balanced_accuracy_score

acc    = accuracy_score(y_true, y_pred)
b_acc  = balanced_accuracy_score(y_true, y_pred)  # better for imbalance
Precision (Positive Predictive Value)#

Formula

\[\text{Precision} = \frac{TP}{TP + FP}\]

Intuition β€” β€œOf everything I labelled positive, what fraction really was positive?” High precision means few false alarms.

Trade-off β€” raising the classification threshold raises precision but typically lowers recall.

scikit-learn

from sklearn.metrics import precision_score

# binary
p = precision_score(y_true, y_pred)
# multiclass (macro)
p = precision_score(y_true, y_pred, average='macro')
Recall (Sensitivity Β· True Positive Rate)#

Formula

\[\text{Recall} = \frac{TP}{TP + FN}\]

Aliases β€” Sensitivity, True Positive Rate (TPR), Hit Rate.

Intuition β€” β€œOf everything that really was positive, what fraction did I catch?” High recall means few missed detections.

Trade-off β€” lowering the threshold increases recall but typically reduces precision (more false alarms).

scikit-learn

from sklearn.metrics import recall_score

r = recall_score(y_true, y_pred)
r = recall_score(y_true, y_pred, average='macro')  # multiclass
Specificity (True Negative Rate)#

Formula

\[\text{Specificity} = \frac{TN}{TN + FP} = 1 - \text{FPR}\]

Intuition β€” β€œOf everything that really was negative, what fraction did I correctly label as negative?” Closely related to the x-axis of the ROC curve.

Note

scikit-learn does not have a standalone specificity_score. Use recall_score(y_true, y_pred, pos_label=0) to compute it for the negative class in a binary problem.

F1 Score#

Formula

\[F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}} {\text{Precision} + \text{Recall}}\]

The F1 is the harmonic mean of precision and recall. The harmonic mean penalises extreme imbalances between the two: a model with precision = 1.0 and recall = 0.0 gets F1 = 0.0.

When to use it β€” when both false positives and false negatives matter, and you want a single summary number.

When NOT to use it β€” when true negatives matter (e.g. spam filtering, where correctly rejecting spam is also valuable). Use the Matthews Correlation Coefficient (MCC) instead.

scikit-learn

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='macro')  # multiclass
F-beta Score#

Formula

\[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}} {(\beta^2 \cdot \text{Precision}) + \text{Recall}}\]

The scalar \(\beta\) controls the relative weight of recall:

  • \(\beta < 1\) β†’ precision-heavy (F0.5 emphasises precision)

  • \(\beta = 1\) β†’ balanced F1

  • \(\beta > 1\) β†’ recall-heavy (F2 emphasises recall)

Use case β€” information retrieval, medical screening where missing a positive is far costlier than a false alarm.

scikit-learn

from sklearn.metrics import fbeta_score

f2 = fbeta_score(y_true, y_pred, beta=2)

Domain 2 β€” ROC Curve & AUROC#

The Receiver Operating Characteristic curve and the Area Under it are the standard threshold-independent evaluation framework for classification models.

ROC Curve#

What is it?

A plot of True Positive Rate (Recall) on the y-axis against False Positive Rate (1 βˆ’ Specificity) on the x-axis, traced by sweeping the classification threshold from 1.0 down to 0.0.

  • Bottom-left corner (0, 0) β€” threshold = 1.0, predict everything negative.

  • Top-right corner (1, 1) β€” threshold = 0.0, predict everything positive.

  • Diagonal β€” random classifier (AUROC = 0.5).

  • Top-left corner (0, 1) β€” perfect classifier (AUROC = 1.0).

Key insight β€” the shape of the curve exposes the trade-off across all thresholds simultaneously, so you can pick the right operating point for your application after training.

scikit-plots

import scikitplot as skplt

skplt.metrics.plot_roc(
    y_true,
    y_probas,       # shape (n_samples, n_classes)
    plot_macro=True,
    plot_micro=True,
)
AUROC (Area Under the ROC Curve)#

Formula

The AUROC equals the probability that the model assigns a higher score to a randomly chosen positive example than to a randomly chosen negative example:

\[\text{AUROC} = P\!\bigl(\hat{s}_{\text{pos}} > \hat{s}_{\text{neg}}\bigr)\]

This interpretation, due to Bamber (1975), makes AUROC a purely rank-based measure β€” it is invariant to monotone score transformations.

Interpretation scale

AUROC

Interpretation

0.50

Random guessing β€” no discriminative power

0.70 – 0.80

Acceptable discrimination

0.80 – 0.90

Good discrimination

0.90 – 1.00

Excellent / near-perfect discrimination

1.00

Perfect ranking (no real problem is this clean)

Limitation β€” AUROC can be optimistic on highly imbalanced datasets because it counts TN (the large negative class) heavily. Consider the PR-AUC in that scenario.

scikit-learn

from sklearn.metrics import roc_auc_score

# binary
auc = roc_auc_score(y_true, y_score)
# multiclass (macro OvR)
auc = roc_auc_score(
    y_true, y_score_matrix,
    multi_class='ovr', average='macro'
)
Precision-Recall Curve & PR-AUC#

What is it?

A plot of Precision on the y-axis against Recall on the x-axis as the threshold varies. The Average Precision (AP), or the area under this curve, summarises performance.

When to prefer PR-AUC over AUROC

In severely imbalanced datasets (e.g., rare-event detection, fraud, medical screening where positives are < 5 % of data), the PR curve exposes model weaknesses that the ROC curve can hide β€” because the ROC curve’s x-axis (FPR) is diluted by the enormous TN pool.

scikit-plots

skplt.metrics.plot_precision_recall(
    y_true, y_probas, plot_micro=True
)
Gini Coefficient (in ML context)#

Relationship to AUROC

In machine learning (particularly credit scoring and finance), the Gini coefficient is defined as:

\[\text{Gini} = 2 \cdot \text{AUROC} - 1\]

It maps the AUROC from the range [0.5, 1.0] to the range [0.0, 1.0]:

AUROC

Gini

Interpretation

0.50

0.00

Random

0.75

0.50

Good

1.00

1.00

Perfect

When you will see it β€” credit risk models, insurance, any domain that adopted the Gini metric before AUROC became standard. The two metrics are equivalent information.


Domain 3 β€” Averaging Strategies & Multiclass Metrics#

When K > 2, every binary metric must be extended to the multi-class case. The choice of averaging strategy changes the answer.

Binary Β· Multiclass Β· Multi-label Classification#

Definition β€” exactly two mutually exclusive classes: positive (1) and negative (0).

Output β€” a single probability score per sample, or a binary label.

Example β€” spam vs. not-spam; disease positive vs. negative.

Definition β€” K > 2 mutually exclusive classes. Each sample belongs to exactly one class.

Output β€” a probability vector of length K; the argmax is the predicted class.

Example β€” MNIST digit recognition (0–9); flower species classification (Iris dataset).

Also called β€” single-label classification, multi-class classification.

Definition β€” each sample can belong to multiple classes simultaneously (classes are not mutually exclusive).

Output β€” a binary indicator vector of length K per sample.

Example β€” image tagging (a photo can be β€œdog”, β€œoutdoor”, β€œdaytime” simultaneously); document topic labelling.

Key difference β€” the per-class AUROC and F1 are computed independently for each label.

Macro Averaging#

Definition

Compute the metric separately for each class, then take the unweighted (simple) mean across all K classes:

\[\text{Metric}_{\text{macro}} = \frac{1}{K} \sum_{i=1}^{K} \text{Metric}_i\]

Properties

  • Each class contributes equally, regardless of how many samples it has.

  • Sensitive to performance on minority classes β€” poor discrimination on a rare class lowers the macro score noticeably.

  • Best choice when you care about per-class fairness.

Example (Macro F1)

from sklearn.metrics import f1_score

f1_macro = f1_score(y_true, y_pred, average='macro')
Micro Averaging#

Definition

Aggregate the TP, TN, FP, FN counts across all classes into global totals, then compute the metric once from those totals:

\[\text{Precision}_{\text{micro}} = \frac{\sum_i TP_i}{\sum_i (TP_i + FP_i)}\]

Properties

  • Dominated by majority classes because large classes contribute more counts.

  • Equivalent to accuracy for multiclass single-label problems.

  • Best choice when overall sample-level performance matters and class size differences are acceptable.

Example (Micro F1)

from sklearn.metrics import f1_score

f1_micro = f1_score(y_true, y_pred, average='micro')
Weighted Averaging#

Definition

Compute the metric per class, then take a weighted mean where each class’s weight equals its proportion of samples (support):

\[\text{Metric}_{\text{weighted}} = \sum_{i=1}^{K} w_i \cdot \text{Metric}_i, \quad w_i = \frac{n_i}{\sum_j n_j}\]

Properties

  • Accounts for class imbalance without completely ignoring minority classes (unlike micro).

  • The default in many scikit-learn reports.

  • Can still mask poor minority-class performance if the majority class is large enough.

Example

from sklearn.metrics import f1_score

f1_weighted = f1_score(y_true, y_pred, average='weighted')
Macro AUROC (Macro-Averaged AUROC)#

Definition

Extend binary AUROC to K classes using the One-vs-Rest (OvR) strategy, then average:

\[\text{AUROC}_{\text{macro}} = \frac{1}{K} \sum_{i=1}^{K} \text{AUROC}(\text{class}_i \text{ vs. rest})\]

Example computation

For a 3-class problem (A, B, C):

Binary AUROC

Value

AUROC(A vs. B+C)

0.85

AUROC(B vs. A+C)

0.72

AUROC(C vs. A+B)

0.65

\[\text{AUROC}_{\text{macro}} = \frac{0.85 + 0.72 + 0.65}{3} = 0.74\]

If class C is rare but performs poorly, Macro AUROC reflects this because every class has equal weight.

scikit-learn

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(
    y_true, y_probas, multi_class='ovr', average='macro'
)
Micro AUROC#

Definition

Flatten all one-vs-rest binary predictions into a single long vector of true labels and scores, then compute a single AUROC:

\[\text{AUROC}_{\text{micro}} = \text{AUROC}\!\left(\, \bigoplus_{i=1}^{K} y_i^{(\text{bin})},\; \bigoplus_{i=1}^{K} \hat{s}_i \right)\]

Properties

  • Heavily influenced by majority classes (more samples β†’ more weight).

  • Provides an overall view of ranking quality across all decisions.

  • Can look good even when rare classes are poorly ranked.

Macro vs. Micro β€” when to use which

Situation

Recommended

Classes should be treated equally

Macro AUROC

Overall sample-level ranking matters

Micro AUROC

Imbalanced and every class matters

Macro AUROC (reveals minority weaknesses)

Class sizes are acceptable to weight

Weighted AUROC

One-vs-Rest (OvR) and One-vs-One (OvO)#

One-vs-Rest (OvR) β€” also called One-vs-All (OvA)

For K classes, train K binary classifiers. Classifier i treats class i as the positive class and all others as the negative class.

Property

OvR

Number of classifiers

K

Training set size per classifier

Full dataset (imbalanced: 1 positive class vs. K-1 negatives)

Prediction

Argmax of K confidence scores

AUROC computation

Average of K binary AUROCs

One-vs-One (OvO)

Train one binary classifier for every pair of classes.

Property

OvO

Number of classifiers

K(Kβˆ’1)/2

Training set size per classifier

Only the two relevant classes (balanced)

Prediction

Majority vote over all pairwise classifiers

AUROC computation

Average of all pairwise AUROCs (Hand & Till, 2001)

scikit-learn

from sklearn.metrics import roc_auc_score

# OvR macro
auc_ovr = roc_auc_score(
    y_true, y_probas, multi_class='ovr', average='macro'
)
# OvO macro (Hand & Till)
auc_ovo = roc_auc_score(
    y_true, y_probas, multi_class='ovo', average='macro'
)

Domain 4 β€” Class Imbalance & Sampling Strategies#

Most real-world classification datasets are imbalanced. The severity ranges from mildly unequal class frequencies (1:2) to extreme imbalance (fraud: 1:10 000). The response strategies fall into three groups: re-weighting, oversampling, and undersampling.

Class Imbalance β€” Overview#

Definition

A dataset is imbalanced when the class frequencies differ substantially β€” typically taken as a ratio exceeding 1:5 (minority : majority).

Why it matters for metrics

A classifier that predicts the majority class for every sample achieves misleadingly high accuracy. Standard metrics (accuracy, Macro F1) can therefore be poor guides.

Summary of strategies

Strategy

Mechanism

Best when

Class weighting

Penalise majority class errors more

Small to moderate imbalance

Random oversampling

Duplicate minority samples

Quick baseline

SMOTE

Synthesise minority samples

Feature-space interpolation is valid

Random undersampling

Remove majority samples

Very large majority class

NearMiss

Keep majority samples closest to minority

Hard-boundary learning

Cluster-based undersampling

Keep one majority representative per cluster

Structured majority class

Class Weighting#

Mechanism

Assign each sample a loss weight inversely proportional to its class frequency, so that minority-class errors are penalised more heavily during training:

\[w_i = \frac{n_{\text{total}}}{K \cdot n_i}\]

where \(n_i\) is the count of class i.

Advantages

  • No data is discarded or synthesised β€” uses the original distribution.

  • Straightforward β€” most scikit-learn estimators accept class_weight='balanced'.

Limitations β€” does not change the decision boundary in the input space; only adjusts the training loss.

scikit-learn

from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

weights = compute_class_weight(
    'balanced', classes=np.unique(y_train), y=y_train
)
model = LogisticRegression(class_weight='balanced')
Oversampling#

Mechanism

Increase the size of the minority class by generating additional samples β€” either by random duplication or synthetic generation (SMOTE).

Random oversampling β€” duplicate existing minority samples with replacement until the desired ratio is reached. Risk: overfitting to duplicated points.

When to use β€” when the minority class is too small to learn meaningful boundaries. Always apply oversampling only to the training set, never to validation or test sets.

SMOTE (Synthetic Minority Over-sampling Technique)#

Definition

SMOTE generates synthetic minority-class samples by interpolating in feature space between an existing minority sample and one of its k nearest minority neighbours:

\[x_{\text{new}} = x_i + \lambda \cdot (x_{\tilde{\text{nn}}} - x_i), \quad \lambda \sim \mathcal{U}(0, 1)\]

Properties

  • Creates genuinely new points (not copies) β†’ less overfitting than random oversampling.

  • Can create noisy samples in overlapping class regions.

  • Assumes that interpolation in feature space is meaningful (invalid for categorical features without encoding).

Library

from imblearn.over_sampling import SMOTE

sm = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

Note

imbalanced-learn (imblearn) is the standard library for SMOTE and related techniques. It is not part of scikit-learn but shares the same API.

Undersampling Strategies#

Undersampling reduces the majority class to match the minority.

Mechanism β€” randomly remove majority-class samples.

Risk β€” discards potentially useful information from the majority class. Use stratified splits to preserve class proportions in validation.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)

Mechanism β€” keep majority-class samples that are closest to minority-class samples (distance-based selection).

  • NearMiss-1 β€” select majority samples whose average distance to the nearest minority neighbours is smallest.

  • NearMiss-2 β€” select majority samples whose average distance to the farthest minority neighbours is smallest.

  • NearMiss-3 β€” for each minority sample, keep its M nearest majority neighbours.

When to use β€” when you want the majority class to concentrate near the decision boundary (hard-margin learning).

from imblearn.under_sampling import NearMiss

nm = NearMiss(version=1)
X_res, y_res = nm.fit_resample(X_train, y_train)

Mechanism β€” cluster the majority class with k-means (or another algorithm), then retain one representative per cluster (typically the centroid or the sample closest to it).

Advantages

  • Preserves the structure of the majority class.

  • Less information loss than random removal.

  • Works well when the majority class has natural sub-groups.

Subsampling#

Dual meaning β€” β€œsubsampling” appears in two completely different contexts:

Selecting a random subset of the dataset for training or evaluation β€” analogous to random undersampling, but often applied for computational efficiency rather than class balancing.

from sklearn.utils import resample

X_sub, y_sub = resample(
    X, y, n_samples=10_000, random_state=42
)

Reducing the sampling rate of a discrete signal. Also called decimation or downsampling.

Critical rule β€” always apply a low-pass filter before subsampling to prevent aliasing (see Aliasing & the Nyquist-Shannon Theorem).

import scipy.signal as sps

# Decimate signal by factor 4 (includes anti-aliasing filter)
signal_down = sps.decimate(signal, q=4)

Domain 5 β€” Statistical Foundations#

Probability & Probability Distributions#

Probability β€” a number in [0, 1] expressing the likelihood of an event. A model’s output score is a probability estimate (not necessarily a calibrated probability β€” see Domain 7 β€” Calibration).

Key distributions in ML

Distribution

Role in ML

Bernoulli

Binary label; output of a binary classifier

Categorical

Multiclass label; output of a softmax classifier

Gaussian (Normal)

Assumption in linear discriminant analysis, GPs

Beta

Prior / posterior for probabilities (Bayesian)

Dirichlet

Prior over class probability vectors (Bayesian multiclass)

Bootstrap Confidence Intervals#

What is it?

A non-parametric resampling method to estimate the sampling distribution β€” and hence confidence intervals β€” of any statistic.

Algorithm

  1. Draw B bootstrap samples of size n with replacement from the original dataset.

  2. Compute the statistic (e.g., AUROC, F1) on each bootstrap sample.

  3. The 95 % CI is the (2.5th, 97.5th) percentile of the B computed statistics.

Why it matters for scikit-plots β€” every metric visualised by scikit-plots can be accompanied by a bootstrap CI to quantify uncertainty.

import numpy as np
from sklearn.metrics import roc_auc_score

rng = np.random.default_rng(42)
B   = 1000
aucs = [
    roc_auc_score(
        y_true[idx := rng.integers(len(y_true), size=len(y_true))],
        y_score[idx]
    )
    for _ in range(B)
]
ci_low, ci_high = np.percentile(aucs, [2.5, 97.5])
print(f"AUROC 95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
Mann-Whitney U Test (Wilcoxon Rank-Sum Test)#

What is it?

A non-parametric test for whether two independent samples were drawn from the same distribution. It makes no assumptions about the underlying distribution (no normality required).

Relationship to AUROC β€” remarkably, the Mann-Whitney U statistic is algebraically equivalent to the AUROC:

\[\text{AUROC} = \frac{U}{n_{\text{pos}} \cdot n_{\text{neg}}}\]

where \(U\) is the Mann-Whitney statistic. This confirms the AUROC’s rank-statistic interpretation.

Use in model comparison β€” test whether model A’s AUROC is significantly greater than model B’s using a permutation test or DeLong’s method.

from scipy.stats import mannwhitneyu

stat, p_value = mannwhitneyu(
    scores_positive, scores_negative, alternative='greater'
)

Domain 6 β€” Fairness & Bias Metrics#

Fairness metrics quantify whether a model treats different demographic groups equitably. No single definition of fairness is universally correct β€” the appropriate criterion depends on the application and its societal context.

Demographic Parity (Statistical Parity)#

Definition

The positive prediction rate is equal across all demographic groups A and B:

\[P(\hat{Y}=1 \mid A) = P(\hat{Y}=1 \mid B)\]

Interpretation β€” the selection rate (e.g., loan approval, job interview, parole) is independent of group membership.

Limitation β€” if the true positive rate differs between groups (due to legitimate factors), forcing equal selection rates may require ignoring relevant features.

When to apply β€” allocation decisions where equal access is the primary concern (e.g., advertising, content recommendation).

Equal Opportunity#

Definition

The True Positive Rate (Recall) is equal across groups:

\[P(\hat{Y}=1 \mid Y=1, A) = P(\hat{Y}=1 \mid Y=1, B)\]

Interpretation β€” among truly qualified/positive individuals, the model identifies them at the same rate regardless of group.

Use case β€” hiring, academic admission, loan approval β€” where it is critical that deserving applicants are equally detected across groups.

Equalized Odds#

Definition

Both the True Positive Rate and the False Positive Rate are equal across groups:

\[P(\hat{Y}=1 \mid Y=y, A) = P(\hat{Y}=1 \mid Y=y, B), \quad \text{for } y \in \{0, 1\}\]

Interpretation β€” a stronger requirement than equal opportunity: the model must treat both positive and negative individuals equally across groups.

Trade-off β€” equalized odds, equal opportunity, and demographic parity are mathematically incompatible in general (except in degenerate cases). You must choose which criterion fits the application.

Predictive Parity (Calibration Fairness)#

Definition

The Positive Predictive Value (Precision) is equal across groups:

\[P(Y=1 \mid \hat{Y}=1, A) = P(Y=1 \mid \hat{Y}=1, B)\]

Interpretation β€” when the model predicts β€œpositive” for a member of either group, the prediction is equally trustworthy.

Use case β€” risk scoring tools (recidivism, credit risk) where the model score is used as a probability estimate and must be equally reliable for all groups.

Note

Predictive parity and equalized odds cannot both be satisfied simultaneously unless base rates are equal across groups (the Chouldechova impossibility result, 2017).


Domain 7 β€” Calibration#

A model is calibrated if its output probabilities match empirical event frequencies. Calibration is independent of discrimination (AUROC): a model can rank perfectly but be poorly calibrated, or be well-calibrated but with low AUROC.

Calibration & Reliability Diagrams#

What is it?

A reliability diagram (calibration curve) plots the mean predicted probability (x-axis) against the observed event rate (y-axis) for bins of predictions. A perfectly calibrated model lies on the diagonal y = x.

  • Over-confident β€” predicted probabilities are too high (curve below the diagonal).

  • Under-confident β€” predicted probabilities are too low (curve above the diagonal).

Brier Score

A calibration-sensitive proper scoring rule:

\[\text{BS} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2\]

Lower is better; 0 = perfect; 0.25 = random for a balanced binary problem.

scikit-plots

skplt.metrics.plot_calibration_curve(
    y_true,
    [y_prob_model_1, y_prob_model_2],
    clf_names=['Model 1', 'Model 2'],
    n_bins=10,
)

Post-hoc calibration

from sklearn.calibration import CalibratedClassifierCV

cal = CalibratedClassifierCV(base_clf, method='isotonic', cv=5)
cal.fit(X_train, y_train)
cal_probs = cal.predict_proba(X_test)

Domain 8 β€” Signal Processing & Time Series#

Relevant when scikit-plots is used to evaluate models applied to sequential or temporal data.

Time Series#

Definition

A sequence of observations ordered in time, \(\{x_t\}_{t=1}^{T}\). Key properties:

  • Stationarity β€” statistical properties (mean, variance) do not change over time.

  • Autocorrelation β€” observations at time t are correlated with observations at \(t - k\) (lag k).

  • Seasonality β€” periodic patterns (daily, weekly, yearly).

Evaluation difference from i.i.d. data

Evaluating ML models on time series requires temporal cross-validation (walk-forward validation), not random K-fold, to avoid data leakage from the future into the training set.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_tr, X_te = X[train_idx], X[test_idx]
Signal Processing#

Definition

The analysis, synthesis, and transformation of signals β€” including audio, sensor data, EEG, accelerometers, financial tick data.

Key concepts for ML practitioners

  • Sampling rate \(f_s\) β€” how many samples per second.

  • Nyquist frequency \(f_N = f_s / 2\) β€” the highest frequency representable without aliasing.

  • Frequency domain β€” the Fourier transform converts a time- domain signal to its frequency components.

Low-pass Filtering#

Definition

A filter that attenuates frequencies above a cut-off frequency \(f_c\) and passes frequencies below it.

Why it is required before subsampling

If you reduce the sampling rate by factor d without filtering, frequency components above \(f_s / (2d)\) fold back into the representable range β€” this is aliasing and causes irreversible distortion.

scikit-learn / scipy

import scipy.signal as sps

# Design a Butterworth low-pass filter
b, a = sps.butter(N=4, Wn=0.25, btype='low')
filtered = sps.filtfilt(b, a, signal)

# Decimate (apply anti-alias filter + downsample)
downsampled = sps.decimate(signal, q=4)
Aliasing & the Nyquist-Shannon Theorem#

Nyquist-Shannon Sampling Theorem

A continuous signal that has no frequency component above \(f_{\max}\) can be perfectly reconstructed from its samples if and only if the sampling rate satisfies:

\[f_s \geq 2 f_{\max}\]

Aliasing β€” when \(f_s < 2 f_{\max}\), high-frequency components β€œfold” into lower-frequency aliases, creating distortion that cannot be undone post-hoc.

Practical rule

Before any downsampling by factor d, apply a low-pass filter with cut-off \(f_c \leq f_s / (2d)\). The scipy.signal.decimate function does this automatically.


Quick Reference β€” Metric Selector#

Use this table to choose the right metric for your problem:

Situation

Avoid

Use Instead

If Multiclass

If Imbalanced

Balanced binary classification

β€”

F1, AUROC

Macro F1, OvR AUROC

Macro F1

Severely imbalanced binary

Accuracy

PR-AUC, F1

Macro F1

PR-AUC

All classes equally important

Micro avg.

Macro avg.

Macro AUROC

Macro AUROC

Overall sample-level performance

Macro avg.

Micro avg.

Micro F1

Check macro too

Probability quality (not just ranking)

AUROC alone

AUROC + Brier Score

Calibration curve

Brier Score

Fairness audit required

Global accuracy

Group-level TPR/FPR

Equal Opportunity

Equalized Odds


Sources#

The following sources were consulted in preparing this page. All links were verified as of the documentation build date.

Core API & Framework Documentation

Authoritative Papers & Textbooks

Learning Resources

Tags: purpose: reference domain: statistics model-type: classification level: beginner level: intermediate level: advanced