📊 Data Preparation & Analysis
Building, scoring and trusting predictive models
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC

Data Preparation & Analysis#

This hub covers the applied predictive-modelling workflow: framing a prediction problem, fitting a model, and — most importantly for scikit-plotsevaluating it with the right chart for the right question. It is the practical companion to the Terminology reference: terminology defines the metrics, this page shows the workflow that produces and reads them.

It is written for three readers at once:

  • newcomers who want the intuition behind model evaluation;

  • practitioners choosing between ROC, lift, gains and threshold tuning;

  • reviewers who need diagnostics (residuals, outliers) before shipping.

Note

Detail is collapsed by default. Open the dropdown for a term, follow the See also cross-links to wander related ideas, and use Ctrl + F or the Sphinx search to jump straight to a topic. Every code snippet uses a real scikitplot / scikit-learn call.


Discovery at a Glance#

What a predictive model is, and what “good” means.

🧭 What is a Prediction Model?

Inputs → learned mapping → scored output, and the train / validate / test discipline that keeps it honest.

What is a Prediction Model?
✔️ Assessing Model Quality

Discrimination vs. calibration vs. business value — three different questions, three different checks.

Assessing the Quality of a Prediction Model
🔀 Binary vs. Nominal Targets

How the target’s shape (two classes vs. unordered many) changes which metrics apply.

Binary vs. Nominal (Multiclass) Targets

The everyday toolkit for scoring classifiers.

📈 ROC & AUC

Ranking quality across every threshold at once — and how to plot it in scikit-plots.

ROC Curve & AUC
🎚️ Threshold Optimization

Turning scores into decisions: choosing the cut-off that matches your cost trade-off.

Threshold Optimization
📊 Gains, Lift & Deciles

“If I contact the top 20 %, how much better than random?” — the campaign manager’s metric.

Cumulative Gains, Lift & Deciles

Interpretable models and what to check before trusting them.

🌳 Decision Trees & CART

Piecewise, rule-based models that capture interactions and explain themselves.

Decision Trees & CART (Interactions, Piecewise Structure)
🔬 Residuals & Outliers

Studentized residuals to find the points your model cannot explain.

Residual Diagnostics & Outliers
🧩 Explaining Clusters

Using a tree to turn an opaque clustering into human-readable rules.

Explaining Clustering Results with a Tree

Part 1 — Prediction Models & What “Good” Means#

Before any chart, fix the question: what are we predicting, and how will we know the model is any good?

What is a Prediction Model?#

What is it?

A prediction (or supervised) model learns a mapping from input features \(X\) to a target \(y\) from labelled examples, so that it can score new, unseen inputs. Classification predicts a category; regression predicts a number.

The honesty discipline

Performance must be measured on data the model never saw during fitting. The standard split is train → validation (for tuning) → test (for the final, untouched estimate):

from sklearn.model_selection import train_test_split

X_tr, X_tmp, y_tr, y_tmp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=0
)
X_val, X_te, y_val, y_te = train_test_split(
    X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=0
)

When to use it — any task where past labelled outcomes can guide future decisions (churn, fraud, response, risk).

Assessing the Quality of a Prediction Model#

Three independent questions

A model can be strong on one axis and weak on another, so check all three:

  • Discrimination — does it rank positives above negatives? (ROC-AUC, gains, lift).

  • Calibration — do predicted probabilities match observed frequencies? (reliability curve, Brier score).

  • Business value — does acting on it beat the baseline at your operating point? (lift at the contacted fraction, profit curve).

scikit-plots connection

import scikitplot as skplt

# One call renders confusion matrix + ROC + PR for a quick read
skplt.metrics.plot_classifier_eval(y_true, y_pred, y_probas)
Binary vs. Nominal (Multiclass) Targets#

Binary — exactly two outcomes (positive / negative). The full confusion-matrix vocabulary (TP/FP/FN/TN) and threshold tuning apply directly.

Nominal / multiclass — three or more unordered categories. Each metric must be averaged across classes (macro / micro / weighted), and ROC-AUC is computed One-vs-Rest or One-vs-One.

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))   # per-class + averages

Part 2 — Classification Evaluation#

The core loop: score → rank → choose a threshold → read the trade-off.

ROC Curve & AUC#

What is it?

The ROC curve plots True Positive Rate against False Positive Rate as the decision threshold sweeps from 1 to 0. The AUC (area under it) summarises ranking quality in a single number:

\[\text{AUC} = P\big(\hat{s}(x^{+}) > \hat{s}(x^{-})\big)\]

i.e. the probability a random positive is scored above a random negative. 0.5 = random, 1.0 = perfect.

scikit-plots connection

import scikitplot as skplt
skplt.metrics.plot_roc(y_true, y_probas)

When to use it — ranking/threshold-free comparison. For imbalanced problems, read it alongside Precision–Recall / lift, which are more sensitive to the minority class.

Threshold Optimization#

The problem

A classifier outputs a score; a decision needs a cut-off. The default 0.5 is rarely optimal — the right threshold depends on the relative cost of false positives vs. false negatives.

A cost-aware choice

Pick the threshold \(t\) that minimises expected cost:

\[t^{\*} = \arg\min_{t}\; C_{FP}\,\text{FP}(t) + C_{FN}\,\text{FN}(t)\]
import numpy as np
from sklearn.metrics import precision_recall_curve

prec, rec, thr = precision_recall_curve(y_true, y_score)
f1 = 2 * prec * rec / (prec + rec + 1e-12)
best_t = thr[np.nanargmax(f1[:-1])]   # threshold maximising F1

When to use it — whenever a model’s output drives an action with asymmetric consequences (medical screening, fraud holds, mailing cost).


Part 3 — Gains, Lift & Decile Analysis#

The “how much better than random, at the fraction I can afford to act on” family — a particular strength of scikit-plots’ decile plots.

Cumulative Gains, Lift & Deciles#

What is it?

Rank all cases by predicted score, descending, and bin into deciles (top 10 %, next 10 %, …).

  • Cumulative gains — the share of all true positives captured by the top k % of ranked cases.

  • Lift — gains divided by the baseline (random) rate:

\[\text{Lift}(k) = \frac{\text{response rate in top } k\%} {\text{overall response rate}}\]

A lift of 3 at the top decile means that group responds 3× more than average — exactly the question behind targeted campaigns.

scikit-plots connection

import scikitplot as skplt

skplt.metrics.plot_cumulative_gain(y_true, y_probas)
skplt.metrics.plot_lift_curve(y_true, y_probas)
skplt.metrics.plot_ks_statistic(y_true, y_probas)   # max separation

When to use it — ranked-action problems with a budget: direct mail, retention offers, lead scoring, collections.


Part 4 — Decision Trees & Diagnostics#

Interpretable models, and the residual checks that reveal where any model breaks down.

Decision Trees & CART (Interactions, Piecewise Structure)#

What is it?

A CART (Classification And Regression Tree) recursively splits the feature space into axis-aligned regions, predicting a constant in each leaf. It is therefore a piecewise-constant model that captures interactions automatically: a split on one feature changes which splits matter below it.

Splits are chosen to reduce impurity — Gini for classification:

\[G = \sum_{c} p_c\,(1 - p_c)\]

scikit-learn

from sklearn.tree import DecisionTreeClassifier, plot_tree

tree = DecisionTreeClassifier(max_depth=4, min_samples_leaf=50)
tree.fit(X_tr, y_tr)
plot_tree(tree, filled=True, feature_names=cols)

When to use it — when an interpretable, rule-based model and explicit interactions matter more than squeezing out the last point of accuracy. Control depth / leaf size to avoid overfitting.

Explaining Clustering Results with a Tree#

The idea

Clustering (e.g. k-means) produces group labels but no explanation. Treat those cluster labels as a target and fit a shallow decision tree on the original features — the tree’s splits become a human-readable description of what makes each cluster different.

from sklearn.tree import DecisionTreeClassifier

explainer = DecisionTreeClassifier(max_depth=3)
explainer.fit(X, cluster_labels)        # labels from KMeans, etc.

When to use it — segmentation deliverables where stakeholders need “Cluster 2 = high spend, low tenure” rather than centroid coordinates.

Residual Diagnostics & Outliers#

What is it?

A residual is the gap between observed and predicted value, \(e_i = y_i - \hat{y}_i\). Studentized residuals rescale each residual by its estimated standard deviation so they are comparable; points with \(|e_i^{\text{stud}}| > 3\) are candidate outliers the model cannot explain.

scikit-plots connection

import scikitplot as skplt
skplt.api.metrics.plot_residuals_distribution(y_true, y_pred)

When to use it — after fitting any regression (or probability) model, to check for structure, heteroscedasticity, and influential outliers before trusting predictions.


Map to scikit-plots Examples#

Worked, runnable galleries for the workflow above (verified links):

Classifier evaluation

Confusion matrix, ROC and PR in one figure.

https://scikit-plots.github.io/dev/auto_examples/classification/plot_classifier_eval_script.html
Cumulative gains

Share of positives captured by the top deciles.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_cumulative_gain_script.html
Lift curve

Improvement over random at each decile.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_lift_script.html
KS statistic

Maximum class separation along the ranked score.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_ks_statistic_script.html
modelplotpy

Business-facing gains / lift / response reports.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_modelplotpy_script.html
Residuals distribution

Residual and Q–Q diagnostics for fitted models.

https://scikit-plots.github.io/dev/auto_examples/regression/plot_residuals_distribution_script.html

Sources#

Verified during preparation of this page; links were resolvable at the documentation build date.

Source context (framing only, re-expressed in our own words)

Official documentation (API calls used above)

scikit-plots (this project)

Standard references

  • James, Witten, Hastie & Tibshirani, An Introduction to Statistical Learning (free): https://www.statlearning.com/

  • Breiman, Friedman, Olshen & Stone, Classification and Regression Trees (CART), 1984.

Tags: purpose: reference domain: statistics model-type: classification model-workflow: model evaluation level: beginner level: intermediate level: advanced