Introduction to modelplotpy#
To install the latest version (with pip):
>>> pip install --upgrade scikit-plots
This exercise is used in ModelPlotPy
class the part of the
modelplotpy and modelplotpy financial sections.
References
A tutorial exercise example: Predictive models from sklearn on the Bank Marketing Data Set
This example is based on a publicly available dataset, called the Bank Marketing Data Set. It is one of the most popular datasets which is made available on the UCI Machine Learning Repository.
The data set comes from a Portuguese bank and deals with a frequently-posed marketing question: whether a customer did or did not acquire a term deposit, a financial product. There are 4 datasets available and the bank-additional-full.csv is the one we use. It contains the information of 41.188 customers and 21 columns of information.
To illustrate how to use modelplotpy, let’s say that we work for this bank and our marketing colleagues have asked us to help to select the customers that are most likely to respond to a term deposit offer. For that purpose, we will develop a predictive model and create the plots to discuss the results with our marketing colleagues. Since we want to show you how to build the plots, not how to build a perfect model, we’ll use six of these columns in our example.
Here’s a short description on the data we use:
y
: has the client subscribed a term deposit?duration
: last contact duration, in seconds (numeric)campaign
: number of contacts performed during this campaign and for this clientpdays
: number of days that passed by after the client was last contacted from a previous campaignprevious
: number of contacts performed before this campaign and for this client (numeric)euribor3m
: euribor 3 month rate
Let’s load the data and have a quick look at it:
47 # Authors: The scikit-plots developers
48 # SPDX-License-Identifier: BSD-3-Clause
Loading the dataset#
55 import io
56 import os
57 import zipfile
58
59 import warnings
60
61 warnings.filterwarnings("ignore")
62
63 import numpy as np
64
65 np.random.seed(0) # reproducibility
66
67 import requests
68 import pandas as pd
69
70 # You can change the path, currently the data is written to the working directory
71 path = os.getcwd()
72
73 # r = requests.get(
74 # "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
75 # )
76 # we encountered that the source at uci.edu is not always available,
77 # therefore we made a copy to our repos.
78 r = requests.get("https://modelplot.github.io/img/bank-additional.zip")
79 z = zipfile.ZipFile(io.BytesIO(r.content))
80 z.extractall(path)
81
82 # Load csv data
83 bank = pd.read_csv(path + "/bank-additional/bank-additional-full.csv", sep=";")
84
85 # select the 6 columns
86 bank = bank[["y", "duration", "campaign", "pdays", "previous", "euribor3m"]]
87
88 # rename target class value 'yes' for better interpretation
89 bank.y[bank.y == "yes"] = "term deposit"
90
91 # dimensions of the data
92 print(bank.shape)
93
94 # show the first rows of the dataset
95 print(bank.head())
(41188, 6)
y duration campaign pdays previous euribor3m
0 no 261 1 999 0 4.857
1 no 149 1 999 0 4.857
2 no 226 1 999 0 4.857
3 no 151 1 999 0 4.857
4 no 307 1 999 0 4.857
Train models on the bank dataset#
On this data, we’ve applied some predictive modeling techniques from the sklearn module. This well known module is a wrapper for many predictive modeling techniques, such as logistic regression, random forest and many, many others. Lets train a few models to evaluate with our plots.
106 # to create predictive models
107 from sklearn.ensemble import RandomForestClassifier
108 from sklearn.linear_model import LogisticRegression
109 from sklearn.model_selection import train_test_split
110
111 # define target vector y
112 y = bank.y
113 # define feature matrix X
114 X = bank.drop("y", axis=1)
115
116 # Create the necessary datasets to build models
117 X_train, X_test, y_train, y_test = train_test_split(
118 X, y, test_size=0.3, random_state=2018
119 )
120
121 # Instantiate a few classification models
122 clf_rf = RandomForestClassifier().fit(X_train, y_train)
123 clf_mult = LogisticRegression(multi_class="multinomial", solver="newton-cg").fit(
124 X_train, y_train
125 )
Plotting partial dependence for two features#
For now, we focus on explaining to our marketing colleagues how good our predictive model can help them select customers for their term deposit campaign.
134 import scikitplot.modelplotpy as mp
135
136 # from scikitplot import modelplotpy as mp
137
138 obj = mp.ModelPlotPy(
139 feature_data=[X_train, X_test],
140 label_data=[y_train, y_test],
141 dataset_labels=["train_data", "test_data"],
142 models=[clf_rf, clf_mult],
143 model_labels=["random_forest", "multinomial_logit"],
144 ntiles=10,
145 )
146
147 # transform data generated with prepare_scores_and_deciles into aggregated data for chosen plotting scope
148 ps = obj.plotting_scope(
149 select_model_label=["random_forest"], select_dataset_label=["test_data"]
150 )
Default scope value no_comparison selected, single evaluation line will be plotted.
The label with smallest class is term deposit
Target class term deposit, dataset test_data and model random_forest.
What just happened? In the modelplotpy a class is instantiated and the plotting_scope function specifies the scope of the plots you want to show. In general, there are 3 methods (functions) that can be applied to the modelplotpy class but you don’t have to specify them since they are chained to each other.
These functions are:
prepare_scores_and_deciles
: scores the customers in the train dataset and test dataset with their probability to acquire a term depositaggregate_over_deciles
: aggregates all scores to deciles and calculates the information to showplotting_scope
: allows you to specify the scope of the analysis.
In the second line of code, we specified the scope of the analysis. We’ve not specified the “scope” parameter, therefore the default - no comparison - is chosen. As the output notes, you can use modelplotpy to evaluate your model(s) from several perspectives:
Interpret just one model (the default)
Compare the model performance across different datasets
Compare the performance across different models
Compare the performance across different target classes
Here, we will keep it simple and evaluate - from a business perspective - how well a selected model
will perform in a selected dataset for one target class. We did specify values for some parameters,
to focus on the random forest model on the test data. The default value for the target class is
term deposit
since we want to focus on customers that do take term deposits,
this default is perfect.
Let’s introduce the Gains, Lift and (cumulative) Response plots.#
Although each plot sheds light on the business value of your model from a different angle, they all use the same data:
Predicted probability for the target class
Equally sized groups based on this predicted probability
Actual number of observed target class observations in these groups
1. Cumulative gains plot#
The cumulative gains plot - often named ‘gains plot’ - helps you answer the question:
When we apply the model and select the best X deciles, what % of the actual target class observations can we expect to target?
200 # plot the cumulative gains plot and annotate the plot at decile = 3
201 ax = mp.plot_cumgains(ps, highlight_ntile=3, save_fig=True)

When we select 30% with the highest probability according to model random_forest, this selection holds 93% of all term deposit cases in dataset test_data.
2. Cumulative lift plot#
The cumulative lift plot, often referred to as lift plot or index plot, helps you answer the question:
When we apply the model and select the best X deciles, how many times better is that than using no model at all?
223 # plot the cumulative lift plot and annotate the plot at decile = 3
224 ax = mp.plot_cumlift(ps, highlight_ntile=3, save_fig=True)

When we select 30% with the highest probability according to model random_forest in dataset test_data, this selection for target class term deposit is 3.13 times than selecting without a model.
3. Response plot#
One of the easiest to explain evaluation plots is the response plot. It simply plots the percentage of target class observations per decile. It can be used to answer the following business question:
When we apply the model and select decile X, what is the expected % of target class observations in that decile?
249 # plot the response plot and annotate the plot at decile = 3
250 ax = mp.plot_response(ps, highlight_ntile=3, save_fig=True)

When we select decile 3 from model random_forest in dataset test_data the percentage of term deposit cases in the selection is 13%.
4. Cumulative response plot#
Finally, one of the most used plots: The cumulative response plot. It answers the question burning on each business reps lips:
When we apply the model and select up until decile X, what is the expected % of target class observations in the selection?
262 # plot the cumulative response plot and annotate the plot at decile = 3
263 ax = mp.plot_cumresponse(ps, highlight_ntile=3, save_fig=True)

When we select deciles 1 until 3 according to model random_forest in dataset test_data the percentage of term deposit cases in the selection is 35%.
All four plots together#
With the function call plot_all we get all four plots on one grid. We can easily save it to a file to include it in a presentation or share it with colleagues.
272 # plot all four evaluation plots and save to file
273 ax = mp.plot_all(
274 ps,
275 save_fig=True,
276 overwrite=False,
277 add_timestamp=True,
278 verbose=True,
279 )

[INFO] Saving path to: /home/circleci/repo/galleries/examples/modelplotpy/result_images/plot_all_20250627_090921Z.png
[INFO] Plot saved to: /home/circleci/repo/galleries/examples/modelplotpy/result_images/plot_all_20250627_090921Z.png
Financial Implications#
To plot the financial implications of implementing a predictive model, modelplotr provides three additional plots: the Costs & revenues plot, the Profit plot and the ROI plot.
For financial plots, three extra parameters need to be provided:
1. Return on investment plot#
The Return on Investment plot plots the cumulative revenues as a percentage of investments up until that decile when the model is used for campaign selection. It can be used to answer the following business question:
When we apply the model and select up until decile X, what is the expected % return on investment of the campaign?
308 # Return on Investment (ROI) plot
309 ax = mp.plot_roi(
310 ps,
311 fixed_costs=1000,
312 variable_costs_per_unit=10,
313 profit_per_unit=50,
314 highlight_ntile=3,
315 save_fig=True,
316 )

When we select decile 1 until 3 from model random_forest in dataset test_data the percentage of term deposit cases in the expected expected return on investment is 70%.
2. Costs & Revenues plot#
The costs & revenues plot plots both the cumulative revenues and the cumulative costs (investments) up until that decile when the model is used for campaign selection. It can be used to answer the following business question:
When we apply the model and select up until decile X, what are the expected revenues and investments of the campaign?
329 # Costs & Revenues plot, highlighted at max roi instead of max profit
330 ax = mp.plot_costsrevs(
331 ps,
332 fixed_costs=1000,
333 variable_costs_per_unit=10,
334 profit_per_unit=50,
335 highlight_ntile=3,
336 # highlight_ntile = "max_roi",
337 save_fig=True,
338 )

When we select decile 1 until 3 from model random_forest in dataset test_data the percentage of term deposit cases in the revenue is 64950.
3. Profit plot#
The profit plot visualized the cumulative profit up until that decile when the model is used for campaign selection. It can be used to answer the following business question:
When we apply the model and select up until decile X, what is the expected profit of the campaign?
351 # Profit plot , highlighted at custom ntile instead of at max profit
352 ax = mp.plot_profit(
353 ps,
354 fixed_costs=1000,
355 variable_costs_per_unit=10,
356 profit_per_unit=50,
357 highlight_ntile=3,
358 save_fig=True,
359 )

When we select decile 1 until 3 from model random_forest in dataset test_data the percentage of term deposit cases in the expected profit is 26880.
Get more out of modelplotpy: using different scopes#
As we mentioned discussed earlier, the modelplotpy also enables to make interesting comparisons, using the scope parameter. Comparisons between different models, between different datasets and (in case of a multiclass target) between different target classes. Curious? Please have a look at the package documentation or read our other posts on modelplot.
1. compare_models#
However, to give one example, we could compare whether random forest was indeed the best choice to select the top-30% customers for a term deposit offer:
376 ps2 = obj.plotting_scope(scope="compare_models", select_dataset_label=["test_data"])
377
378 # plot the cumulative response plot and annotate the plot at decile = 3
379 ax = mp.plot_cumresponse(ps2, highlight_ntile=3, save_fig=True)

compare models
The label with smallest class is ['term deposit']
When we select deciles 1 until 3 according to model multinomial_logit in dataset test_data the percentage of term deposit cases in the selection is 33%.
When we select deciles 1 until 3 according to model random_forest in dataset test_data the percentage of term deposit cases in the selection is 35%.
383 import scikitplot as sp
384
385 sp.remove_path()
Seems like the algorithm used will not make a big difference in this case. Hopefully you agree by now that using these plots really can make a difference in explaining the business value of your predictive models!
In case you experience issues when using modelplotpy, please let us know via the issues section on Github. Any other feedback or suggestions, please let us know via pb.marcus or jurriaan.nagelkerke.
Happy modelplotting!
Total running time of the script: (0 minutes 8.318 seconds)
Related examples