As I discussed in an earlier post, one common mistake in classification is to treat an uncalibrated score as a probability. In the latest version of ML-Insights, we provide some simple functionality to use cross-validation and splines to calibrate models. After the calibration, the output of the model can more properly be used as a probability.
Interestingly, even quite accurate models like Gradient Boosting (including XGBoost) benefit substantially from a calibration approach.
Using ML-Insights, just a few short lines of code can improve your performance greatly when using probability based metrics. It’s as easy as:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', n_jobs=-1)
rfm_calib = mli.SplineCalibratedClassifierCV(rfm)
test_res_calib = rfm_calib.predict_proba(X_test)[:,1]
I’ve written a couple of nice jupyter notebooks which walk through this issue quite carefully. Check them out if you are interested! The Calibration_Example_ICU_MIMIC_Short notebook is best if you want to get right to the point. For a more detailed explanation look at Calibration_Example_ICU_MIMIC.
Would love to hear any feedback or suggestions on this!
Just gave a talk today at MLConf 2016 in San Francisco on Interpreting Black-Box Models with Applications to Healthcare. Was great to publicly release the first version of our ML Insights package for python. The github repository complete with code can be found here together with some nicely worked-out examples. Additional documentation can be found here.
Thanks to Courtney and Nick for giving me the opportunity to present. The slides are here: mlc_model_interp_talk
To whet your appetite, here are some visualizations you can easily create in a few lines after “pip install ml_insights”:
One criticism of modern machine learning algorithms such as random forests and gradient boosting is that the models are difficult to interpret. Whereas in linear regression, the model makes assertions such as “every additional bathroom adds $25,000 to the value a home”, there is no simple equivalent in the tree-based regression methods.
Of course, the reality of the world is that there is no single number that gives the value of another bathroom to a property. Adding a 4th bathrooom to a 1200 sq. ft. 2-bedroom condo may add little value, while adding a 2nd bathroom to a 1600 sq. ft., 4-bedroom house may yield an unusually large increase. The right answer should depend both on “which bathroom” and the context of other variables. So, one could argue, failing to give a simplistic, single-coefficient answer is a feature and not a bug.
In some sense, the more complicated, data-intensive (i.e. lower bias, higher variance) models are better able to capture these complicated dependencies, and thereby give more accurate predictions. However, we are still left with a desire to understand what is going on inside of these “black-box” models both for our own understanding and to build our confidence in the accuracy of the model.
One approach that is gaining recognition is the use of partial dependence plots (Friedman). The idea is to vary a single variable and examine how the output of a (black box) model changes. When doing this, you will typically notice two things:
- The resulting plot is not a line.
- The resulting plot depends on the values of the other features.
I’m in the process of developing tools (with my colleague Ramesh Sampath) to generate such plots and make them easier to interpret. Stay tuned for more details soon!
I recently finished teaching my first data science bootcamp at Metis. It was a great experience: interacting with really smart and driven students, learning by teaching, and getting all sorts of new ideas about data science.