📖 Terminology Reference
Your complete guide to ML & Data Science concepts in scikit-plots
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC

Terminology#

This reference organises every machine-learning and data-science term you will encounter when using scikit-plots — from the most elementary ideas (What is a True Positive?) to expert-level subtleties (Macro-averaged AUROC in imbalanced multiclass problems).

Each entry answers four questions concisely:

What is it? — a plain-English definition.
Formula / Key Relationship — the exact mathematical statement.
When to use it — the right context.
scikit-plots / scikit-learn connection — the API call that produces or consumes this concept.

Use the level tabs below to start at the depth that suits you, then follow cross-references to go deeper.

Note

Terms are grouped by domain, not alphabetically, so that related concepts appear together. Use your browser’s Ctrl + F or the Sphinx search to jump to a specific term.

Discovery at a Glance#

🟢 Start Here — Foundations

Core building blocks every practitioner must know. No formulas required — just intuition.

📋 Confusion Matrix

The 2×2 (or K×K) table that underpins every classification metric. Start here.

Domain 1 — Confusion Matrix & Core Metrics

🎯 Precision & Recall

The fundamental trade-off: catching more positives vs. trusting your predictions.

Precision (Positive Predictive Value)

⚖️ F1 Score

The harmonic mean of precision and recall — a single number that balances both.

F1 Score

🏷️ Classification Types

Binary, multiclass, multi-label — which problem are you actually solving?

Binary · Multiclass · Multi-label Classification

📉 Data Imbalance

When one class dominates — oversampling, undersampling, and class weighting.

Domain 4 — Class Imbalance & Sampling Strategies

📈 ROC Curve

The performance landscape across every classification threshold at once.

Domain 2 — ROC Curve & AUROC

🔵 Go Deeper — Metrics

Intermediate concepts for practitioners building and evaluating real models.

📐 Averaging Strategies

Macro vs. Micro vs. Weighted — how single numbers are derived from per-class scores.

Domain 3 — Averaging Strategies & Multiclass Metrics

🔢 Multiclass AUROC

Extending the ROC framework from two classes to K classes with OvR and OvO strategies.

Domain 3 — Averaging Strategies & Multiclass Metrics

🧪 SMOTE & Sampling

Synthetic minority oversampling, NearMiss, cluster-based strategies — when and how.

Oversampling

📊 Statistical Tests

Bootstrap CIs, Mann-Whitney U, and other tools for comparing models rigorously.

Domain 5 — Statistical Foundations

🎛️ Calibration

Does P̂ = 0.8 really mean 80 % likely? Reliability diagrams and calibration curves.

Domain 7 — Calibration

⚡ Signal Processing

Subsampling, downsampling, aliasing, low-pass filtering — for time-series and DSP work.

Domain 8 — Signal Processing & Time Series

🔴 Expert — Advanced Concepts

Nuanced topics for senior practitioners, researchers, and contributors.

⚖️ Fairness Metrics

Demographic parity, equal opportunity, equalized odds, predictive parity — choosing the right fairness criterion.

Domain 6 — Fairness & Bias Metrics

🔄 OvR vs. OvO

One-vs-Rest and One-vs-One decomposition strategies and their impact on AUROC computation.

One-vs-Rest (OvR) and One-vs-One (OvO)

📉 Gini Coefficient

The relationship between Gini index and AUROC — and when Gini is the preferred reporting metric.

Gini Coefficient (in ML context)

🧬 Bootstrap CIs

Constructing confidence intervals for any metric without parametric assumptions.

Bootstrap Confidence Intervals

🔬 Imbalance + Fairness

When class imbalance interacts with group fairness — the hidden pitfalls.

Domain 6 — Fairness & Bias Metrics

📡 Aliasing & Nyquist

Why subsampling without a low-pass filter corrupts signals — the Nyquist-Shannon theorem.

Aliasing & the Nyquist-Shannon Theorem

Domain 1 — Confusion Matrix & Core Metrics#

The confusion matrix is the single most important data structure in classification evaluation. All threshold-based metrics derive from its four cells.

Confusion Matrix#

What is it?

A square table that tallies the agreement and disagreement between a classifier’s predicted labels and the true labels on a held-out dataset. For a binary problem it has four cells:

Cell	Full Name	Meaning
TP	True Positive	Predicted positive, actually positive
TN	True Negative	Predicted negative, actually negative
FP	False Positive (Type I error)	Predicted positive, actually negative
FN	False Negative (Type II error)	Predicted negative, actually positive

For a K-class problem the matrix is K×K: row i, column j counts samples with true label i predicted as label j. The diagonal contains correct predictions.

scikit-plots connection

from sklearn.metrics import confusion_matrix
import scikitplot as skplt

# Plot normalised confusion matrix
skplt.metrics.plot_confusion_matrix(
    y_true, y_pred, normalize=True
)

When to use it

Always — it is the foundation for every derived metric. Inspect the raw counts before trusting any single-number summary.

True Positive · True Negative · False Positive · False Negative#

Definitions

These four quantities are the atoms of classification evaluation.

Symbol	Intuition	Domain example (disease screening)
TP	Correct positive detection	Test says “sick”, patient is sick ✅
TN	Correct negative detection	Test says “healthy”, patient is healthy ✅
FP	False alarm	Test says “sick”, patient is healthy ❌
FN	Missed detection	Test says “healthy”, patient is sick ❌

The cost of FP and FN is domain-specific — in fraud detection, FN (missed fraud) is often far costlier than FP (flagging a legitimate transaction). Always decide which error is worse before selecting a threshold.

Domain 2 — ROC Curve & AUROC#

The Receiver Operating Characteristic curve and the Area Under it are the standard threshold-independent evaluation framework for classification models.

AUROC (Area Under the ROC Curve)#

Formula

The AUROC equals the probability that the model assigns a higher score to a randomly chosen positive example than to a randomly chosen negative example:

\[\text{AUROC} = P\!\bigl(\hat{s}_{\text{pos}} > \hat{s}_{\text{neg}}\bigr)\]

This interpretation, due to Bamber (1975), makes AUROC a purely rank-based measure — it is invariant to monotone score transformations.

Interpretation scale

AUROC	Interpretation
0.50	Random guessing — no discriminative power
0.70 – 0.80	Acceptable discrimination
0.80 – 0.90	Good discrimination
0.90 – 1.00	Excellent / near-perfect discrimination
1.00	Perfect ranking (no real problem is this clean)

Limitation — AUROC can be optimistic on highly imbalanced datasets because it counts TN (the large negative class) heavily. Consider the PR-AUC in that scenario.

scikit-learn

from sklearn.metrics import roc_auc_score

# binary
auc = roc_auc_score(y_true, y_score)
# multiclass (macro OvR)
auc = roc_auc_score(
    y_true, y_score_matrix,
    multi_class='ovr', average='macro'
)

Gini Coefficient (in ML context)#

Relationship to AUROC

In machine learning (particularly credit scoring and finance), the Gini coefficient is defined as:

\[\text{Gini} = 2 \cdot \text{AUROC} - 1\]

It maps the AUROC from the range [0.5, 1.0] to the range [0.0, 1.0]:

AUROC	Gini	Interpretation
0.50	0.00	Random
0.75	0.50	Good
1.00	1.00	Perfect

When you will see it — credit risk models, insurance, any domain that adopted the Gini metric before AUROC became standard. The two metrics are equivalent information.

Domain 3 — Averaging Strategies & Multiclass Metrics#

When K > 2, every binary metric must be extended to the multi-class case. The choice of averaging strategy changes the answer.

Macro AUROC (Macro-Averaged AUROC)#

Definition

Extend binary AUROC to K classes using the One-vs-Rest (OvR) strategy, then average:

\[\text{AUROC}_{\text{macro}} = \frac{1}{K} \sum_{i=1}^{K} \text{AUROC}(\text{class}_i \text{ vs. rest})\]

Example computation

For a 3-class problem (A, B, C):

Binary AUROC	Value
AUROC(A vs. B+C)	0.85
AUROC(B vs. A+C)	0.72
AUROC(C vs. A+B)	0.65

\[\text{AUROC}_{\text{macro}} = \frac{0.85 + 0.72 + 0.65}{3} = 0.74\]

If class C is rare but performs poorly, Macro AUROC reflects this because every class has equal weight.

scikit-learn

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(
    y_true, y_probas, multi_class='ovr', average='macro'
)

Micro AUROC#

Definition

Flatten all one-vs-rest binary predictions into a single long vector of true labels and scores, then compute a single AUROC:

\[\text{AUROC}_{\text{micro}} = \text{AUROC}\!\left(\, \bigoplus_{i=1}^{K} y_i^{(\text{bin})},\; \bigoplus_{i=1}^{K} \hat{s}_i \right)\]

Properties

Heavily influenced by majority classes (more samples → more weight).
Provides an overall view of ranking quality across all decisions.
Can look good even when rare classes are poorly ranked.

Macro vs. Micro — when to use which

Situation	Recommended
Classes should be treated equally	Macro AUROC
Overall sample-level ranking matters	Micro AUROC
Imbalanced and every class matters	Macro AUROC (reveals minority weaknesses)
Class sizes are acceptable to weight	Weighted AUROC

One-vs-Rest (OvR) and One-vs-One (OvO)#

One-vs-Rest (OvR) — also called One-vs-All (OvA)

For K classes, train K binary classifiers. Classifier i treats class i as the positive class and all others as the negative class.

Property	OvR
Number of classifiers	K
Training set size per classifier	Full dataset (imbalanced: 1 positive class vs. K-1 negatives)
Prediction	Argmax of K confidence scores
AUROC computation	Average of K binary AUROCs

One-vs-One (OvO)

Train one binary classifier for every pair of classes.

Property	OvO
Number of classifiers	K(K−1)/2
Training set size per classifier	Only the two relevant classes (balanced)
Prediction	Majority vote over all pairwise classifiers
AUROC computation	Average of all pairwise AUROCs (Hand & Till, 2001)

scikit-learn

from sklearn.metrics import roc_auc_score

# OvR macro
auc_ovr = roc_auc_score(
    y_true, y_probas, multi_class='ovr', average='macro'
)
# OvO macro (Hand & Till)
auc_ovo = roc_auc_score(
    y_true, y_probas, multi_class='ovo', average='macro'
)

Domain 4 — Class Imbalance & Sampling Strategies#

Most real-world classification datasets are imbalanced. The severity ranges from mildly unequal class frequencies (1:2) to extreme imbalance (fraud: 1:10 000). The response strategies fall into three groups: re-weighting, oversampling, and undersampling.

Class Imbalance — Overview#

Definition

A dataset is imbalanced when the class frequencies differ substantially — typically taken as a ratio exceeding 1:5 (minority : majority).

Why it matters for metrics

A classifier that predicts the majority class for every sample achieves misleadingly high accuracy. Standard metrics (accuracy, Macro F1) can therefore be poor guides.

Summary of strategies

Strategy	Mechanism	Best when
Class weighting	Penalise majority class errors more	Small to moderate imbalance
Random oversampling	Duplicate minority samples	Quick baseline
SMOTE	Synthesise minority samples	Feature-space interpolation is valid
Random undersampling	Remove majority samples	Very large majority class
NearMiss	Keep majority samples closest to minority	Hard-boundary learning
Cluster-based undersampling	Keep one majority representative per cluster	Structured majority class

Domain 5 — Statistical Foundations#

Probability & Probability Distributions#

Probability — a number in [0, 1] expressing the likelihood of an event. A model’s output score is a probability estimate (not necessarily a calibrated probability — see Domain 7 — Calibration).

Key distributions in ML

Distribution	Role in ML
Bernoulli	Binary label; output of a binary classifier
Categorical	Multiclass label; output of a softmax classifier
Gaussian (Normal)	Assumption in linear discriminant analysis, GPs
Beta	Prior / posterior for probabilities (Bayesian)
Dirichlet	Prior over class probability vectors (Bayesian multiclass)

Domain 6 — Fairness & Bias Metrics#

Fairness metrics quantify whether a model treats different demographic groups equitably. No single definition of fairness is universally correct — the appropriate criterion depends on the application and its societal context.

Domain 7 — Calibration#

A model is calibrated if its output probabilities match empirical event frequencies. Calibration is independent of discrimination (AUROC): a model can rank perfectly but be poorly calibrated, or be well-calibrated but with low AUROC.

Domain 8 — Signal Processing & Time Series#

Relevant when scikit-plots is used to evaluate models applied to sequential or temporal data.

Quick Reference — Metric Selector#

Use this table to choose the right metric for your problem:

Situation	Avoid	Use Instead	If Multiclass	If Imbalanced
Balanced binary classification	—	F1, AUROC	Macro F1, OvR AUROC	Macro F1
Severely imbalanced binary	Accuracy	PR-AUC, F1	Macro F1	PR-AUC
All classes equally important	Micro avg.	Macro avg.	Macro AUROC	Macro AUROC
Overall sample-level performance	Macro avg.	Micro avg.	Micro F1	Check macro too
Probability quality (not just ranking)	AUROC alone	AUROC + Brier Score	Calibration curve	Brier Score
Fairness audit required	Global accuracy	Group-level TPR/FPR	Equal Opportunity	Equalized Odds

Sources#

The following sources were consulted in preparing this page. All links were verified as of the documentation build date.

Core API & Framework Documentation

scikit-learn — sklearn.metrics module reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
imbalanced-learn — SMOTE and sampling API reference: https://imbalanced-learn.org/stable/references/index.html
SciPy — scipy.signal for digital signal processing: https://docs.scipy.org/doc/scipy/reference/signal.html

Authoritative Papers & Textbooks

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12(4), 387–415. https://doi.org/10.1016/0022-2496(75)90001-2
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171–186. https://doi.org/10.1023/A:1010920819831
Chawla, N. V. et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Chouldechova, A. (2017). Fair prediction with disparate impact. Big Data, 5(2), 153–163. https://doi.org/10.1089/big.2016.0047
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21. https://doi.org/10.1109/jrproc.1949.232969

Learning Resources

Google Machine Learning Crash Course — Classification: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
insightful-data-lab.com — Terminology category (source of context and domain framing for this page): https://insightful-data-lab.com/category/00terminology/
scikit-plots documentation — Metrics API: https://scikit-plots.github.io/dev/apis/index.html

Tags: purpose: reference domain: statistics model-type: classification level: beginner level: intermediate level: advanced