Often students working on binary (0-1) classification problems would tell me that a particular model or approach “doesn’t work”.  When I would ask to see the results of the model (say, on a holdout set), they would show me a confusion matrix where the model predicted 0 for every data point.  When I asked what threshold they used, or what the ROC curve (or better yet, Precision-Recall Curve) looked like, only then would they start to realize that they had missed something important.

One very important aspect of binary classification that is (IMHO) not sufficiently stressed is that there are actually three different problems:

1. Hard Classification – firmly deciding to make a hard 0/1 call for each data point in the test set.
2. Ranking Classification – “scoring” each data point, where a higher score means more likely to be a ‘1’ (and thereby ranking the entire test set from most likely to least likely to be a ‘1’).
3. Probability Prediction – assigning to each point a (well-calibrated) probability that it is a ‘1’.

Note that there is a structure to these three problems.  A probability prediction implies a ranking classification (treating the probability as a score), while a ranking classification implies many possible hard classifications (depending on the choice of a threshold).

Moreover, there are many different metrics that are only meaningful to the particular one of the three problems.  Accuracy is a measure of a hard classification,  AUC (ROC) measures and compares different ranking classifications, and the log-likelihood of a test set would measure probability prediction.

In the next post, we’ll discuss best practices in approaching these three different problems, as well as two common mistakes data scientists make by confusing them.