📊 Data Preparation & Analysis
Building, scoring and trusting predictive models
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC

Data Preparation & Analysis#

This hub covers the applied predictive-modelling workflow: framing a prediction problem, fitting a model, and — most importantly for scikit-plots — evaluating it with the right chart for the right question. It is the practical companion to the Terminology reference: terminology defines the metrics, this page shows the workflow that produces and reads them.

It is written for three readers at once:

newcomers who want the intuition behind model evaluation;
practitioners choosing between ROC, lift, gains and threshold tuning;
reviewers who need diagnostics (residuals, outliers) before shipping.

Note

Detail is collapsed by default. Open the dropdown for a term, follow the See also cross-links to wander related ideas, and use Ctrl + F or the Sphinx search to jump straight to a topic. Every code snippet uses a real scikitplot / scikit-learn call.

Discovery at a Glance#

🟢 Start Here — Foundations

What a predictive model is, and what “good” means.

🧭 What is a Prediction Model?

Inputs → learned mapping → scored output, and the train / validate / test discipline that keeps it honest.

What is a Prediction Model?

✔️ Assessing Model Quality

Discrimination vs. calibration vs. business value — three different questions, three different checks.

Assessing the Quality of a Prediction Model

🔀 Binary vs. Nominal Targets

How the target’s shape (two classes vs. unordered many) changes which metrics apply.

Binary vs. Nominal (Multiclass) Targets

🔵 Core — Evaluation

The everyday toolkit for scoring classifiers.

📈 ROC & AUC

Ranking quality across every threshold at once — and how to plot it in scikit-plots.

ROC Curve & AUC

🎚️ Threshold Optimization

Turning scores into decisions: choosing the cut-off that matches your cost trade-off.

Threshold Optimization

📊 Gains, Lift & Deciles

“If I contact the top 20 %, how much better than random?” — the campaign manager’s metric.

Cumulative Gains, Lift & Deciles

🔴 Advanced — Models & Diagnostics

Interpretable models and what to check before trusting them.

🌳 Decision Trees & CART

Piecewise, rule-based models that capture interactions and explain themselves.

Decision Trees & CART (Interactions, Piecewise Structure)

🔬 Residuals & Outliers

Studentized residuals to find the points your model cannot explain.

Residual Diagnostics & Outliers

🧩 Explaining Clusters

Using a tree to turn an opaque clustering into human-readable rules.

Explaining Clustering Results with a Tree

Part 1 — Prediction Models & What “Good” Means#

Before any chart, fix the question: what are we predicting, and how will we know the model is any good?

Part 2 — Classification Evaluation#

The core loop: score → rank → choose a threshold → read the trade-off.

Part 3 — Gains, Lift & Decile Analysis#

The “how much better than random, at the fraction I can afford to act on” family — a particular strength of scikit-plots’ decile plots.

Part 4 — Decision Trees & Diagnostics#

Interpretable models, and the residual checks that reveal where any model breaks down.

Map to scikit-plots Examples#

Worked, runnable galleries for the workflow above (verified links):

Classifier evaluation

Confusion matrix, ROC and PR in one figure.

https://scikit-plots.github.io/dev/auto_examples/classification/plot_classifier_eval_script.html

ROC curve

Per-class and averaged ROC with AUC.

https://scikit-plots.github.io/dev/auto_examples/classification/plot_roc_script.html

Precision–Recall

The imbalance-aware companion to ROC.

https://scikit-plots.github.io/dev/auto_examples/classification/plot_precision_recall_script.html

Cumulative gains

Share of positives captured by the top deciles.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_cumulative_gain_script.html

Lift curve

Improvement over random at each decile.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_lift_script.html

KS statistic

Maximum class separation along the ranked score.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_ks_statistic_script.html

modelplotpy

Business-facing gains / lift / response reports.

https://scikit-plots.github.io/dev/auto_examples/decile/plot_modelplotpy_script.html

Residuals distribution

Residual and Q–Q diagnostics for fitted models.

https://scikit-plots.github.io/dev/auto_examples/regression/plot_residuals_distribution_script.html

Sources#

Verified during preparation of this page; links were resolvable at the documentation build date.

Source context (framing only, re-expressed in our own words)

Data Preparation and Analysis category (56 posts): https://insightful-data-lab.com/category/data-preparation-and-analysis/

Official documentation (API calls used above)

scikit-learn — model evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
scikit-learn — decision trees: https://scikit-learn.org/stable/modules/tree.html
scikit-learn — train/test splitting: https://scikit-learn.org/stable/modules/cross_validation.html

scikit-plots (this project)

Example gallery: https://scikit-plots.github.io/dev/auto_examples/index.html
API reference: https://scikit-plots.github.io/dev/apis/index.html
Terminology reference: terminology-index

Standard references

James, Witten, Hastie & Tibshirani, An Introduction to Statistical Learning (free): https://www.statlearning.com/
Breiman, Friedman, Olshen & Stone, Classification and Regression Trees (CART), 1984.

Tags: purpose: reference domain: statistics model-type: classification model-workflow: model evaluation level: beginner level: intermediate level: advanced