As I discussed in an earlier post, one common mistake in classification is to treat an uncalibrated score as a probability. In the latest version of ML-Insights, we provide some simple functionality to use cross-validation and splines to calibrate models. After the calibration, the output of the model can more properly be used as a probability.
Interestingly, even quite accurate models like Gradient Boosting (including XGBoost) benefit substantially from a calibration approach.
Using ML-Insights, just a few short lines of code can improve your performance greatly when using probability based metrics. It’s as easy as:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', n_jobs=-1)
rfm_calib = mli.SplineCalibratedClassifierCV(rfm)
test_res_calib = rfm_calib.predict_proba(X_test)[:,1]
I’ve written a couple of nice jupyter notebooks which walk through this issue quite carefully. Check them out if you are interested! The Calibration_Example_ICU_MIMIC_Short notebook is best if you want to get right to the point. For a more detailed explanation look at Calibration_Example_ICU_MIMIC.
Would love to hear any feedback or suggestions on this!
In most ML methods (including random forests, gradient boosting, logistic regression, and neural networks), the model outputs a score, which yields a “ranking classification“. However, there are two very common mistakes that occur in dealing with this score:
- Using a “default” threshold of 0.5 automatically to convert to a hard classification, rather than examing the performance across a range of thresholds. (This is encouraged by sklearn’s convention that “model.predict” does precisely the former, while the latter requires the clunkier “model.predict_proba“)
- Treating the score directly as a probability, without calibrating it. This is patently wrong when using models like random forests (where the vote proportion certainly does not indicate the probability of being a ‘1’), and inaccurate even in logistic regression (where the output purports to be a probability, but often is not well calibrated).
We’ll dive into these mistakes in more detail in future posts.
Often students working on binary (0-1) classification problems would tell me that a particular model or approach “doesn’t work”. When I would ask to see the results of the model (say, on a holdout set), they would show me a confusion matrix where the model predicted 0 for every data point. When I asked what threshold they used, or what the ROC curve (or better yet, Precision-Recall Curve) looked like, only then would they start to realize that they had missed something important.
One very important aspect of binary classification that is (IMHO) not sufficiently stressed is that there are actually three different problems:
- Hard Classification – firmly deciding to make a hard 0/1 call for each data point in the test set.
- Ranking Classification – “scoring” each data point, where a higher score means more likely to be a ‘1’ (and thereby ranking the entire test set from most likely to least likely to be a ‘1’).
- Probability Prediction – assigning to each point a (well-calibrated) probability that it is a ‘1’.
I recently finished teaching my first data science bootcamp at Metis. It was a great experience: interacting with really smart and driven students, learning by teaching, and getting all sorts of new ideas about data science.