Your complete guide to ML & Data Science concepts in scikit-plots
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC
Terminology#
This reference organises every machine-learning and data-science term you will encounter when using scikit-plots β from the most elementary ideas (What is a True Positive?) to expert-level subtleties (Macro-averaged AUROC in imbalanced multiclass problems).
Each entry answers four questions concisely:
What is it? β a plain-English definition.
Formula / Key Relationship β the exact mathematical statement.
When to use it β the right context.
scikit-plots / scikit-learn connection β the API call that produces or consumes this concept.
Use the level tabs below to start at the depth that suits you, then follow cross-references to go deeper.
Note
Terms are grouped by domain, not alphabetically, so that related concepts appear together. Use your browserβs Ctrl + F or the Sphinx search to jump to a specific term.
Discovery at a Glance#
Core building blocks every practitioner must know. No formulas required β just intuition.
The 2Γ2 (or KΓK) table that underpins every classification metric. Start here.
The fundamental trade-off: catching more positives vs. trusting your predictions.
The harmonic mean of precision and recall β a single number that balances both.
Binary, multiclass, multi-label β which problem are you actually solving?
When one class dominates β oversampling, undersampling, and class weighting.
The performance landscape across every classification threshold at once.
Intermediate concepts for practitioners building and evaluating real models.
Macro vs. Micro vs. Weighted β how single numbers are derived from per-class scores.
Extending the ROC framework from two classes to K classes with OvR and OvO strategies.
Synthetic minority oversampling, NearMiss, cluster-based strategies β when and how.
Bootstrap CIs, Mann-Whitney U, and other tools for comparing models rigorously.
Does PΜ = 0.8 really mean 80 % likely? Reliability diagrams and calibration curves.
Subsampling, downsampling, aliasing, low-pass filtering β for time-series and DSP work.
Nuanced topics for senior practitioners, researchers, and contributors.
Demographic parity, equal opportunity, equalized odds, predictive parity β choosing the right fairness criterion.
One-vs-Rest and One-vs-One decomposition strategies and their impact on AUROC computation.
The relationship between Gini index and AUROC β and when Gini is the preferred reporting metric.
Constructing confidence intervals for any metric without parametric assumptions.
When class imbalance interacts with group fairness β the hidden pitfalls.
Why subsampling without a low-pass filter corrupts signals β the Nyquist-Shannon theorem.
Domain 1 β Confusion Matrix & Core Metrics#
The confusion matrix is the single most important data structure in classification evaluation. All threshold-based metrics derive from its four cells.
Confusion Matrix#
What is it?
A square table that tallies the agreement and disagreement between a classifierβs predicted labels and the true labels on a held-out dataset. For a binary problem it has four cells:
Cell |
Full Name |
Meaning |
|---|---|---|
TP |
True Positive |
Predicted positive, actually positive |
TN |
True Negative |
Predicted negative, actually negative |
FP |
False Positive (Type I error) |
Predicted positive, actually negative |
FN |
False Negative (Type II error) |
Predicted negative, actually positive |
For a K-class problem the matrix is KΓK: row i, column j
counts samples with true label i predicted as label j.
The diagonal contains correct predictions.
scikit-plots connection
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
# Plot normalised confusion matrix
skplt.metrics.plot_confusion_matrix(
y_true, y_pred, normalize=True
)
When to use it
Always β it is the foundation for every derived metric. Inspect the raw counts before trusting any single-number summary.
True Positive Β· True Negative Β· False Positive Β· False Negative#
Definitions
These four quantities are the atoms of classification evaluation.
Symbol |
Intuition |
Domain example (disease screening) |
|---|---|---|
TP |
Correct positive detection |
Test says βsickβ, patient is sick β |
TN |
Correct negative detection |
Test says βhealthyβ, patient is healthy β |
FP |
False alarm |
Test says βsickβ, patient is healthy β |
FN |
Missed detection |
Test says βhealthyβ, patient is sick β |
The cost of FP and FN is domain-specific β in fraud detection, FN (missed fraud) is often far costlier than FP (flagging a legitimate transaction). Always decide which error is worse before selecting a threshold.
Accuracy#
Formula
Intuition β the fraction of all predictions that are correct.
When NOT to use it
Accuracy is misleading for imbalanced datasets. A model that predicts the majority class for every sample achieves high accuracy while being completely useless. Prefer F1, balanced accuracy, AUROC, or class-specific metrics when class frequencies differ substantially.
scikit-learn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
acc = accuracy_score(y_true, y_pred)
b_acc = balanced_accuracy_score(y_true, y_pred) # better for imbalance
Precision (Positive Predictive Value)#
Formula
Intuition β βOf everything I labelled positive, what fraction really was positive?β High precision means few false alarms.
Trade-off β raising the classification threshold raises precision but typically lowers recall.
scikit-learn
from sklearn.metrics import precision_score
# binary
p = precision_score(y_true, y_pred)
# multiclass (macro)
p = precision_score(y_true, y_pred, average='macro')
Recall (Sensitivity Β· True Positive Rate)#
Formula
Aliases β Sensitivity, True Positive Rate (TPR), Hit Rate.
Intuition β βOf everything that really was positive, what fraction did I catch?β High recall means few missed detections.
Trade-off β lowering the threshold increases recall but typically reduces precision (more false alarms).
scikit-learn
from sklearn.metrics import recall_score
r = recall_score(y_true, y_pred)
r = recall_score(y_true, y_pred, average='macro') # multiclass
Specificity (True Negative Rate)#
Formula
Intuition β βOf everything that really was negative, what fraction did I correctly label as negative?β Closely related to the x-axis of the ROC curve.
Note
scikit-learn does not have a standalone specificity_score.
Use recall_score(y_true, y_pred, pos_label=0) to compute it
for the negative class in a binary problem.
F1 Score#
Formula
The F1 is the harmonic mean of precision and recall. The harmonic mean penalises extreme imbalances between the two: a model with precision = 1.0 and recall = 0.0 gets F1 = 0.0.
When to use it β when both false positives and false negatives matter, and you want a single summary number.
When NOT to use it β when true negatives matter (e.g. spam filtering, where correctly rejecting spam is also valuable). Use the Matthews Correlation Coefficient (MCC) instead.
scikit-learn
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='macro') # multiclass
F-beta Score#
Formula
The scalar \(\beta\) controls the relative weight of recall:
\(\beta < 1\) β precision-heavy (F0.5 emphasises precision)
\(\beta = 1\) β balanced F1
\(\beta > 1\) β recall-heavy (F2 emphasises recall)
Use case β information retrieval, medical screening where missing a positive is far costlier than a false alarm.
scikit-learn
from sklearn.metrics import fbeta_score
f2 = fbeta_score(y_true, y_pred, beta=2)
Domain 2 β ROC Curve & AUROC#
The Receiver Operating Characteristic curve and the Area Under it are the standard threshold-independent evaluation framework for classification models.
ROC Curve#
What is it?
A plot of True Positive Rate (Recall) on the y-axis against False Positive Rate (1 β Specificity) on the x-axis, traced by sweeping the classification threshold from 1.0 down to 0.0.
Bottom-left corner (0, 0) β threshold = 1.0, predict everything negative.
Top-right corner (1, 1) β threshold = 0.0, predict everything positive.
Diagonal β random classifier (AUROC = 0.5).
Top-left corner (0, 1) β perfect classifier (AUROC = 1.0).
Key insight β the shape of the curve exposes the trade-off across all thresholds simultaneously, so you can pick the right operating point for your application after training.
scikit-plots
import scikitplot as skplt
skplt.metrics.plot_roc(
y_true,
y_probas, # shape (n_samples, n_classes)
plot_macro=True,
plot_micro=True,
)
AUROC (Area Under the ROC Curve)#
Formula
The AUROC equals the probability that the model assigns a higher score to a randomly chosen positive example than to a randomly chosen negative example:
This interpretation, due to Bamber (1975), makes AUROC a purely rank-based measure β it is invariant to monotone score transformations.
Interpretation scale
AUROC |
Interpretation |
|---|---|
0.50 |
Random guessing β no discriminative power |
0.70 β 0.80 |
Acceptable discrimination |
0.80 β 0.90 |
Good discrimination |
0.90 β 1.00 |
Excellent / near-perfect discrimination |
1.00 |
Perfect ranking (no real problem is this clean) |
Limitation β AUROC can be optimistic on highly imbalanced datasets because it counts TN (the large negative class) heavily. Consider the PR-AUC in that scenario.
scikit-learn
from sklearn.metrics import roc_auc_score
# binary
auc = roc_auc_score(y_true, y_score)
# multiclass (macro OvR)
auc = roc_auc_score(
y_true, y_score_matrix,
multi_class='ovr', average='macro'
)
Precision-Recall Curve & PR-AUC#
What is it?
A plot of Precision on the y-axis against Recall on the x-axis as the threshold varies. The Average Precision (AP), or the area under this curve, summarises performance.
When to prefer PR-AUC over AUROC
In severely imbalanced datasets (e.g., rare-event detection, fraud, medical screening where positives are < 5 % of data), the PR curve exposes model weaknesses that the ROC curve can hide β because the ROC curveβs x-axis (FPR) is diluted by the enormous TN pool.
scikit-plots
skplt.metrics.plot_precision_recall(
y_true, y_probas, plot_micro=True
)
Gini Coefficient (in ML context)#
Relationship to AUROC
In machine learning (particularly credit scoring and finance), the Gini coefficient is defined as:
It maps the AUROC from the range [0.5, 1.0] to the range [0.0, 1.0]:
AUROC |
Gini |
Interpretation |
|---|---|---|
0.50 |
0.00 |
Random |
0.75 |
0.50 |
Good |
1.00 |
1.00 |
Perfect |
When you will see it β credit risk models, insurance, any domain that adopted the Gini metric before AUROC became standard. The two metrics are equivalent information.
Domain 3 β Averaging Strategies & Multiclass Metrics#
When K > 2, every binary metric must be extended to the multi-class case. The choice of averaging strategy changes the answer.
Binary Β· Multiclass Β· Multi-label Classification#
Definition β exactly two mutually exclusive classes: positive (1) and negative (0).
Output β a single probability score per sample, or a binary label.
Example β spam vs. not-spam; disease positive vs. negative.
Definition β K > 2 mutually exclusive classes. Each sample belongs to exactly one class.
Output β a probability vector of length K; the argmax is the predicted class.
Example β MNIST digit recognition (0β9); flower species classification (Iris dataset).
Also called β single-label classification, multi-class classification.
Definition β each sample can belong to multiple classes simultaneously (classes are not mutually exclusive).
Output β a binary indicator vector of length K per sample.
Example β image tagging (a photo can be βdogβ, βoutdoorβ, βdaytimeβ simultaneously); document topic labelling.
Key difference β the per-class AUROC and F1 are computed independently for each label.
Macro Averaging#
Definition
Compute the metric separately for each class, then take the unweighted (simple) mean across all K classes:
Properties
Each class contributes equally, regardless of how many samples it has.
Sensitive to performance on minority classes β poor discrimination on a rare class lowers the macro score noticeably.
Best choice when you care about per-class fairness.
Example (Macro F1)
from sklearn.metrics import f1_score
f1_macro = f1_score(y_true, y_pred, average='macro')
Micro Averaging#
Definition
Aggregate the TP, TN, FP, FN counts across all classes into global totals, then compute the metric once from those totals:
Properties
Dominated by majority classes because large classes contribute more counts.
Equivalent to accuracy for multiclass single-label problems.
Best choice when overall sample-level performance matters and class size differences are acceptable.
Example (Micro F1)
from sklearn.metrics import f1_score
f1_micro = f1_score(y_true, y_pred, average='micro')
Weighted Averaging#
Definition
Compute the metric per class, then take a weighted mean where each classβs weight equals its proportion of samples (support):
Properties
Accounts for class imbalance without completely ignoring minority classes (unlike micro).
The default in many scikit-learn reports.
Can still mask poor minority-class performance if the majority class is large enough.
Example
from sklearn.metrics import f1_score
f1_weighted = f1_score(y_true, y_pred, average='weighted')
Macro AUROC (Macro-Averaged AUROC)#
Definition
Extend binary AUROC to K classes using the One-vs-Rest (OvR) strategy, then average:
Example computation
For a 3-class problem (A, B, C):
Binary AUROC |
Value |
|---|---|
AUROC(A vs. B+C) |
0.85 |
AUROC(B vs. A+C) |
0.72 |
AUROC(C vs. A+B) |
0.65 |
If class C is rare but performs poorly, Macro AUROC reflects this because every class has equal weight.
scikit-learn
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(
y_true, y_probas, multi_class='ovr', average='macro'
)
Micro AUROC#
Definition
Flatten all one-vs-rest binary predictions into a single long vector of true labels and scores, then compute a single AUROC:
Properties
Heavily influenced by majority classes (more samples β more weight).
Provides an overall view of ranking quality across all decisions.
Can look good even when rare classes are poorly ranked.
Macro vs. Micro β when to use which
Situation |
Recommended |
|---|---|
Classes should be treated equally |
Macro AUROC |
Overall sample-level ranking matters |
Micro AUROC |
Imbalanced and every class matters |
Macro AUROC (reveals minority weaknesses) |
Class sizes are acceptable to weight |
Weighted AUROC |
One-vs-Rest (OvR) and One-vs-One (OvO)#
One-vs-Rest (OvR) β also called One-vs-All (OvA)
For K classes, train K binary classifiers. Classifier i treats class i as the positive class and all others as the negative class.
Property |
OvR |
|---|---|
Number of classifiers |
K |
Training set size per classifier |
Full dataset (imbalanced: 1 positive class vs. K-1 negatives) |
Prediction |
Argmax of K confidence scores |
AUROC computation |
Average of K binary AUROCs |
One-vs-One (OvO)
Train one binary classifier for every pair of classes.
Property |
OvO |
|---|---|
Number of classifiers |
K(Kβ1)/2 |
Training set size per classifier |
Only the two relevant classes (balanced) |
Prediction |
Majority vote over all pairwise classifiers |
AUROC computation |
Average of all pairwise AUROCs (Hand & Till, 2001) |
scikit-learn
from sklearn.metrics import roc_auc_score
# OvR macro
auc_ovr = roc_auc_score(
y_true, y_probas, multi_class='ovr', average='macro'
)
# OvO macro (Hand & Till)
auc_ovo = roc_auc_score(
y_true, y_probas, multi_class='ovo', average='macro'
)
Domain 4 β Class Imbalance & Sampling Strategies#
Most real-world classification datasets are imbalanced. The severity ranges from mildly unequal class frequencies (1:2) to extreme imbalance (fraud: 1:10 000). The response strategies fall into three groups: re-weighting, oversampling, and undersampling.
Class Imbalance β Overview#
Definition
A dataset is imbalanced when the class frequencies differ substantially β typically taken as a ratio exceeding 1:5 (minority : majority).
Why it matters for metrics
A classifier that predicts the majority class for every sample achieves misleadingly high accuracy. Standard metrics (accuracy, Macro F1) can therefore be poor guides.
Summary of strategies
Strategy |
Mechanism |
Best when |
|---|---|---|
Class weighting |
Penalise majority class errors more |
Small to moderate imbalance |
Random oversampling |
Duplicate minority samples |
Quick baseline |
SMOTE |
Synthesise minority samples |
Feature-space interpolation is valid |
Random undersampling |
Remove majority samples |
Very large majority class |
NearMiss |
Keep majority samples closest to minority |
Hard-boundary learning |
Cluster-based undersampling |
Keep one majority representative per cluster |
Structured majority class |
Class Weighting#
Mechanism
Assign each sample a loss weight inversely proportional to its class frequency, so that minority-class errors are penalised more heavily during training:
where \(n_i\) is the count of class i.
Advantages
No data is discarded or synthesised β uses the original distribution.
Straightforward β most scikit-learn estimators accept
class_weight='balanced'.
Limitations β does not change the decision boundary in the input space; only adjusts the training loss.
scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
weights = compute_class_weight(
'balanced', classes=np.unique(y_train), y=y_train
)
model = LogisticRegression(class_weight='balanced')
Oversampling#
Mechanism
Increase the size of the minority class by generating additional samples β either by random duplication or synthetic generation (SMOTE).
Random oversampling β duplicate existing minority samples with replacement until the desired ratio is reached. Risk: overfitting to duplicated points.
When to use β when the minority class is too small to learn meaningful boundaries. Always apply oversampling only to the training set, never to validation or test sets.
SMOTE (Synthetic Minority Over-sampling Technique)#
Definition
SMOTE generates synthetic minority-class samples by interpolating in feature space between an existing minority sample and one of its k nearest minority neighbours:
Properties
Creates genuinely new points (not copies) β less overfitting than random oversampling.
Can create noisy samples in overlapping class regions.
Assumes that interpolation in feature space is meaningful (invalid for categorical features without encoding).
Library
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
Note
imbalanced-learn (imblearn) is the standard library for
SMOTE and related techniques. It is not part of scikit-learn but
shares the same API.
Undersampling Strategies#
Undersampling reduces the majority class to match the minority.
Mechanism β randomly remove majority-class samples.
Risk β discards potentially useful information from the majority class. Use stratified splits to preserve class proportions in validation.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)
Mechanism β keep majority-class samples that are closest to minority-class samples (distance-based selection).
NearMiss-1 β select majority samples whose average distance to the nearest minority neighbours is smallest.
NearMiss-2 β select majority samples whose average distance to the farthest minority neighbours is smallest.
NearMiss-3 β for each minority sample, keep its M nearest majority neighbours.
When to use β when you want the majority class to concentrate near the decision boundary (hard-margin learning).
from imblearn.under_sampling import NearMiss
nm = NearMiss(version=1)
X_res, y_res = nm.fit_resample(X_train, y_train)
Mechanism β cluster the majority class with k-means (or another algorithm), then retain one representative per cluster (typically the centroid or the sample closest to it).
Advantages
Preserves the structure of the majority class.
Less information loss than random removal.
Works well when the majority class has natural sub-groups.
Subsampling#
Dual meaning β βsubsamplingβ appears in two completely different contexts:
Selecting a random subset of the dataset for training or evaluation β analogous to random undersampling, but often applied for computational efficiency rather than class balancing.
from sklearn.utils import resample
X_sub, y_sub = resample(
X, y, n_samples=10_000, random_state=42
)
Reducing the sampling rate of a discrete signal. Also called decimation or downsampling.
Critical rule β always apply a low-pass filter before subsampling to prevent aliasing (see Aliasing & the Nyquist-Shannon Theorem).
import scipy.signal as sps
# Decimate signal by factor 4 (includes anti-aliasing filter)
signal_down = sps.decimate(signal, q=4)
Domain 5 β Statistical Foundations#
Probability & Probability Distributions#
Probability β a number in [0, 1] expressing the likelihood of an event. A modelβs output score is a probability estimate (not necessarily a calibrated probability β see Domain 7 β Calibration).
Key distributions in ML
Distribution |
Role in ML |
|---|---|
Bernoulli |
Binary label; output of a binary classifier |
Categorical |
Multiclass label; output of a softmax classifier |
Gaussian (Normal) |
Assumption in linear discriminant analysis, GPs |
Beta |
Prior / posterior for probabilities (Bayesian) |
Dirichlet |
Prior over class probability vectors (Bayesian multiclass) |
Bootstrap Confidence Intervals#
What is it?
A non-parametric resampling method to estimate the sampling distribution β and hence confidence intervals β of any statistic.
Algorithm
Draw B bootstrap samples of size n with replacement from the original dataset.
Compute the statistic (e.g., AUROC, F1) on each bootstrap sample.
The 95 % CI is the (2.5th, 97.5th) percentile of the B computed statistics.
Why it matters for scikit-plots β every metric visualised by scikit-plots can be accompanied by a bootstrap CI to quantify uncertainty.
import numpy as np
from sklearn.metrics import roc_auc_score
rng = np.random.default_rng(42)
B = 1000
aucs = [
roc_auc_score(
y_true[idx := rng.integers(len(y_true), size=len(y_true))],
y_score[idx]
)
for _ in range(B)
]
ci_low, ci_high = np.percentile(aucs, [2.5, 97.5])
print(f"AUROC 95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
Mann-Whitney U Test (Wilcoxon Rank-Sum Test)#
What is it?
A non-parametric test for whether two independent samples were drawn from the same distribution. It makes no assumptions about the underlying distribution (no normality required).
Relationship to AUROC β remarkably, the Mann-Whitney U statistic is algebraically equivalent to the AUROC:
where \(U\) is the Mann-Whitney statistic. This confirms the AUROCβs rank-statistic interpretation.
Use in model comparison β test whether model Aβs AUROC is significantly greater than model Bβs using a permutation test or DeLongβs method.
from scipy.stats import mannwhitneyu
stat, p_value = mannwhitneyu(
scores_positive, scores_negative, alternative='greater'
)
Domain 6 β Fairness & Bias Metrics#
Fairness metrics quantify whether a model treats different demographic groups equitably. No single definition of fairness is universally correct β the appropriate criterion depends on the application and its societal context.
Demographic Parity (Statistical Parity)#
Definition
The positive prediction rate is equal across all demographic groups A and B:
Interpretation β the selection rate (e.g., loan approval, job interview, parole) is independent of group membership.
Limitation β if the true positive rate differs between groups (due to legitimate factors), forcing equal selection rates may require ignoring relevant features.
When to apply β allocation decisions where equal access is the primary concern (e.g., advertising, content recommendation).
Equal Opportunity#
Definition
The True Positive Rate (Recall) is equal across groups:
Interpretation β among truly qualified/positive individuals, the model identifies them at the same rate regardless of group.
Use case β hiring, academic admission, loan approval β where it is critical that deserving applicants are equally detected across groups.
Equalized Odds#
Definition
Both the True Positive Rate and the False Positive Rate are equal across groups:
Interpretation β a stronger requirement than equal opportunity: the model must treat both positive and negative individuals equally across groups.
Trade-off β equalized odds, equal opportunity, and demographic parity are mathematically incompatible in general (except in degenerate cases). You must choose which criterion fits the application.
Predictive Parity (Calibration Fairness)#
Definition
The Positive Predictive Value (Precision) is equal across groups:
Interpretation β when the model predicts βpositiveβ for a member of either group, the prediction is equally trustworthy.
Use case β risk scoring tools (recidivism, credit risk) where the model score is used as a probability estimate and must be equally reliable for all groups.
Note
Predictive parity and equalized odds cannot both be satisfied simultaneously unless base rates are equal across groups (the Chouldechova impossibility result, 2017).
Domain 7 β Calibration#
A model is calibrated if its output probabilities match empirical event frequencies. Calibration is independent of discrimination (AUROC): a model can rank perfectly but be poorly calibrated, or be well-calibrated but with low AUROC.
Calibration & Reliability Diagrams#
What is it?
A reliability diagram (calibration curve) plots the mean predicted probability (x-axis) against the observed event rate (y-axis) for bins of predictions. A perfectly calibrated model lies on the diagonal y = x.
Over-confident β predicted probabilities are too high (curve below the diagonal).
Under-confident β predicted probabilities are too low (curve above the diagonal).
Brier Score
A calibration-sensitive proper scoring rule:
Lower is better; 0 = perfect; 0.25 = random for a balanced binary problem.
scikit-plots
skplt.metrics.plot_calibration_curve(
y_true,
[y_prob_model_1, y_prob_model_2],
clf_names=['Model 1', 'Model 2'],
n_bins=10,
)
Post-hoc calibration
from sklearn.calibration import CalibratedClassifierCV
cal = CalibratedClassifierCV(base_clf, method='isotonic', cv=5)
cal.fit(X_train, y_train)
cal_probs = cal.predict_proba(X_test)
Domain 8 β Signal Processing & Time Series#
Relevant when scikit-plots is used to evaluate models applied to sequential or temporal data.
Time Series#
Definition
A sequence of observations ordered in time, \(\{x_t\}_{t=1}^{T}\). Key properties:
Stationarity β statistical properties (mean, variance) do not change over time.
Autocorrelation β observations at time t are correlated with observations at \(t - k\) (lag k).
Seasonality β periodic patterns (daily, weekly, yearly).
Evaluation difference from i.i.d. data
Evaluating ML models on time series requires temporal cross-validation (walk-forward validation), not random K-fold, to avoid data leakage from the future into the training set.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_tr, X_te = X[train_idx], X[test_idx]
Signal Processing#
Definition
The analysis, synthesis, and transformation of signals β including audio, sensor data, EEG, accelerometers, financial tick data.
Key concepts for ML practitioners
Sampling rate \(f_s\) β how many samples per second.
Nyquist frequency \(f_N = f_s / 2\) β the highest frequency representable without aliasing.
Frequency domain β the Fourier transform converts a time- domain signal to its frequency components.
Low-pass Filtering#
Definition
A filter that attenuates frequencies above a cut-off frequency \(f_c\) and passes frequencies below it.
Why it is required before subsampling
If you reduce the sampling rate by factor d without filtering, frequency components above \(f_s / (2d)\) fold back into the representable range β this is aliasing and causes irreversible distortion.
scikit-learn / scipy
import scipy.signal as sps
# Design a Butterworth low-pass filter
b, a = sps.butter(N=4, Wn=0.25, btype='low')
filtered = sps.filtfilt(b, a, signal)
# Decimate (apply anti-alias filter + downsample)
downsampled = sps.decimate(signal, q=4)
Aliasing & the Nyquist-Shannon Theorem#
Nyquist-Shannon Sampling Theorem
A continuous signal that has no frequency component above \(f_{\max}\) can be perfectly reconstructed from its samples if and only if the sampling rate satisfies:
Aliasing β when \(f_s < 2 f_{\max}\), high-frequency components βfoldβ into lower-frequency aliases, creating distortion that cannot be undone post-hoc.
Practical rule
Before any downsampling by factor d, apply a low-pass filter
with cut-off \(f_c \leq f_s / (2d)\). The scipy.signal.decimate
function does this automatically.
Quick Reference β Metric Selector#
Use this table to choose the right metric for your problem:
Situation |
Avoid |
Use Instead |
If Multiclass |
If Imbalanced |
|---|---|---|---|---|
Balanced binary classification |
β |
F1, AUROC |
Macro F1, OvR AUROC |
Macro F1 |
Severely imbalanced binary |
Accuracy |
PR-AUC, F1 |
Macro F1 |
PR-AUC |
All classes equally important |
Micro avg. |
Macro avg. |
Macro AUROC |
Macro AUROC |
Overall sample-level performance |
Macro avg. |
Micro avg. |
Micro F1 |
Check macro too |
Probability quality (not just ranking) |
AUROC alone |
AUROC + Brier Score |
Calibration curve |
Brier Score |
Fairness audit required |
Global accuracy |
Group-level TPR/FPR |
Equal Opportunity |
Equalized Odds |
Sources#
The following sources were consulted in preparing this page. All links were verified as of the documentation build date.
Core API & Framework Documentation
scikit-learn β
sklearn.metricsmodule reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metricsimbalanced-learn β SMOTE and sampling API reference: https://imbalanced-learn.org/stable/references/index.html
SciPy β
scipy.signalfor digital signal processing: https://docs.scipy.org/doc/scipy/reference/signal.html
Authoritative Papers & Textbooks
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861β874. https://doi.org/10.1016/j.patrec.2005.10.010
Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12(4), 387β415. https://doi.org/10.1016/0022-2496(75)90001-2
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171β186. https://doi.org/10.1023/A:1010920819831
Chawla, N. V. et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321β357. https://doi.org/10.1613/jair.953
Chouldechova, A. (2017). Fair prediction with disparate impact. Big Data, 5(2), 153β163. https://doi.org/10.1089/big.2016.0047
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10β21. https://doi.org/10.1109/jrproc.1949.232969
Learning Resources
Google Machine Learning Crash Course β Classification: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
insightful-data-lab.com β Terminology category (source of context and domain framing for this page): https://insightful-data-lab.com/category/00terminology/
scikit-plots documentation β Metrics API: https://scikit-plots.github.io/dev/apis/index.html