Visualizing Machine Learning Model Performance: Binary Classification

Using Python 3 and the help of a few packages, we will visualize the different performance aspects of a machine learning model built for binary classification. These tools make it easier to compare model performance and help you choose the right one for the task.

The GitHub Repo

You can find all the code used at my model_selection repo on GitHub. There are directions explaining how to use the tools there, but I'll run you through things in more detail here.

Introduction: Why I Made This

At the most recent internship I was a part of, I was tasked with building a set of tools that could help the team better select a model for a binary classification problem. The naive way could be done by mindlessly tuning the model hyperparameters and then running it against the holdout to see if accuracy has improved. But accuracy isn't always the most important thing in this kind of problem. There are other metrics we need to look at in order to make a more informed decision. I'll give some examples of what things to look out for based on a certain kind of problem later on because you may come across cases where a good accuracy is actually not the best model for the task. While accuracy is a decent indicator of a good model, it is not always sufficient. It's also very important to take a look at the confusion matrix: the false positives, false negatives, true positives, and true negatives. These metrics will help you understand the kind of decisions the model is making and give you a better idea of how to improve things.

Setting things up

In this first chunk of code below, we need to import all the files (included in the repo), some visualization libraries, and the machine learning package I'll be using. I'll be using XGBoost, which I have a lot of good things to say about it, but anything that can do binary classification will do. Even logistic regression will do, in fact that's a great place to start testing. More on that later.

import model_utils
import analysis

import csv
import math

import xgboost as xgb
import numpy as np

import matplotlib.pyplot as plt
from matplotlib_venn import venn2'default')

# dataset location
dataset = "./dataset.csv"

# splitting and normalizing data can also be done with a library like Pandas
features, labels = model_utils.preprocess(dataset, normalize=True)
train_x, train_y, test_x, test_y = model_utils.split_data(features, labels, 85, 15)

Visualizing the Confusion Matrix

We are ready to start testing our model. This next block will train the XGBoost model on the dataset. After some rounds of gradient boosting, (the documentation gives a great introduction if you're interested) we'll be ready to look at the visualizations and understand how the model is performing.

# sample parameters that may need to be tuned
param = {'max_depth':9, 'silent':1, 'objective':'binary:logistic',

num_round = 20

# read in data
dtrain = xgb.DMatrix(train_x, label=train_y)
dtest = xgb.DMatrix(test_x, label=test_y)

best = xgb.train(param, dtrain, num_round)
# make predictions
estimated = (best.predict(dtest)).reshape(len(test_x), 1)

# now we can call the method to show performance
analysis.performance(estimated, test_y, verbose=True, visualize=True)


Accuracy:  0.7338888888888889
ZeroR:  0.5305555555555556
Recall:  0.749738219895288
Precision:  0.7489539748953975
Total entries:  1800
True Positive:  716     (40%)
True Negative:  605     (34%)
False Positive:  240    (13%)
False Negative:  239    (13%)
False Positive Rate:  28.402366863905325
False Negative Rate:  25.026178010471206

Making Sense of Things

From the analysis.performance() function call, we're given a list of metrics and a nice visualization to look at. Before we can understand the visual, we need to first understand what makes it. The Venn diagram looking thing is a visualization of the confusion matrix. I highly recommend reading through this guide for a thorough understanding of the confusion matrix. I'll give you the highlights here:

  • true positives (TP): Model predicted yes/1/True and the classification is correct.
  • true negatives (TN): Model predicted no/0/False and the classification is correct.
  • false positives (FP): Model predicted yes, but the actual classification is no.
  • false negatives (FN): Model predicted no, but the actual classification is yes.

From these values, we can calculate a few more metrics that will help us understand how well our model is doing.

  • Accuracy ((TP + TN) / total): Overall, how often is the model correct?
  • Recall (TP / actual yes): When it's actually yes, how often does the model predict yes?
  • Precision (TP / predicted yes): When the model predicts yes, how often is it correct?

We have these three metrics to look at now. We're accustomed to working with accuracy, but recall and precision are often forgotten. This article does an excellent job at comparing the two, and is the inspiration for the visualizer you see here. As with most concepts in data science and machine learning, there is often a trade-off in the metrics we choose to maximize. When we increase the recall, we decrease the precision. Increasing both will increase accuracy. Here's a plain English example to help you understand what these number mean in a practical sense.

In a simple classification problem, if we label all examples as yes/1/True, then our recall is 1.0. We correctly classified every positive case. Great, but that doesn't really mean much. By doing so, we've hurt our precision. Every no case was labeled as yes, and because of that we still have a lot of incorrectly labeled cases. Take a look at the different cases the article I linked above shows. It shows what kind of classifications are being made when you have high recall and low precision, low recall and high precision, etc. Ideally we are looking to maximize both recall and precision, and as a result of these have a maximum accuracy.

Why These Metrics Matter

One of the fundamental truths of machine learning and data science in general is that things are very dependent on the problem and solution space. Just within binary classification problems, a task like credit card fraud detection or "pass / fail" specification is going to need different solutions. Everything is dependent on the dataset and the specification problem. One of these may want to favor recall or precision, so understanding these metrics is incredibly important.

Show comments

New Comment