In most ML methods (including random forests, gradient boosting, logistic regression, and neural networks), the model outputs a score, which yields a “ranking classification“.  However, there are two very common mistakes that occur in dealing with this score:

  1. Using a “default” threshold of 0.5 automatically to convert to a hard classification, rather than examing the performance across a range of thresholds.  (This is encouraged by sklearn’s convention that “model.predict” does precisely the former, while the latter requires the clunkier “model.predict_proba“)
  2. Treating the score directly as a probability, without calibrating it.  This is patently wrong when using models like random forests (where the vote proportion certainly does not indicate the probability of being a ‘1’), and inaccurate even in logistic regression (where the output purports to be a probability, but often is not well calibrated).

We’ll dive into these mistakes in more detail in future posts.