## Data Science, Probability, Life

### Life is best understood through a probabilistic lens

As modern machine learning methods become more ubiquitous, increasing attention is being paid to understanding how these models work. Typically these questions come in two sorts of flavors.

1. In general, what variables are important for this model and which are less influential?
2. For a specific prediction, what factors contributed most heavily to the model’s conclusion?

### General Model Understanding

In question 1, we are trying to get a general understanding of the mechanisms behind the model. For example, suppose we have an algorithm to predict the value of a house, which looks at a dozen or so factors. An “explanation” might be something along the lines of:

“The primary factors are the square footage of the house and the wealth/income of the neighborhood at large. The condition of the house is also important. Other factors such as the number of bathrooms, size of the lot, and whether it has a garage are somewhat important. The rest of the variables have a relatively minor impact.”

Often, people wish to make statements such as “variable X is more important than variable Y”. One caveat to statements such as these is that one needs to consider both the magnitude and the frequency of the impact. For example, imagine a world where half the houses have 2-car garages and half have -car garages, and 0.1% of the houses have luxury swimming pools. All else being equal, a house with a 2-car garage is worth \$20,000 more than the 1-car garage counterpart, but a luxury swimming pool adds \$150,000 to the value of the house. Which variable is more “important”? The swimming pool has a higher magnitude of impact, but much lower frequency (more precisely, the variable corresponding to the swimming pool has lower entropy). There are any number of ways to combine the two factors into a single number, but you will always lose something significant in doing so.

One way to get a feel for “how” a model is doing its reasoning, is to simply see how the predictions change and you change the inputs. Tools such as Individual Conditional Expectation plots precisely examine this. However, some care must be taken in evaluating these.

### Explaining a Specific Prediction

In question 2, we are confronted with a prediction on a simple example and desire a “justification” for the conclusion of the model. Colloquially, we want to know “Why is that house so expensive?” (from the model’s point of view). The kinds of answers we are looking for might be “It’s a 5,000 sq ft mansion!” or “It’s in downtown Manhattan!”.

Beyond a simple reason, we might want something closer to what a real estate agent or professional appraiser might give. Typically, they may start with a baseline estimate, such as “The average house price in the U.S. is $225,000.” Then, from there they would highlight the aspects of this particular house that make it different from typical. “Your town is a bit more expensive than other towns, so that makes the house worth 50K more. This house is smaller than average in your town, which makes it worth 25K less. But it has a relatively large lot (compared to comparably sized homes in your town), which makes it worth 10K more. It it slightly older, which makes it worth 7K less…” and so on. As it happens, methods like SHAP, based on the Shapley value, can do almost exactly the same kind of analysis. XGBoost has integrated SHAP directly, making it possible to get these “prediction explanations” in just a few lines of code. More details to come in future posts! I’ll also be giving a talk on this subject at ODSC West in October 2019 in San Francisco! As I discussed in an earlier post, one common mistake in classification is to treat an uncalibrated score as a probability. In the latest version of ML-Insights, we provide some simple functionality to use cross-validation and splines to calibrate models. After the calibration, the output of the model can more properly be used as a probability. Interestingly, even quite accurate models like Gradient Boosting (including XGBoost) benefit substantially from a calibration approach. Using ML-Insights, just a few short lines of code can improve your performance greatly when using probability based metrics. It’s as easy as: rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', n_jobs=-1) rfm_calib = mli.SplineCalibratedClassifierCV(rfm) rfm_calib.fit(X_train,y_train) test_res_calib = rfm_calib.predict_proba(X_test)[:,1] I’ve written a couple of nice jupyter notebooks which walk through this issue quite carefully. Check them out if you are interested! The Calibration_Example_ICU_MIMIC_Short notebook is best if you want to get right to the point. For a more detailed explanation look at Calibration_Example_ICU_MIMIC. Would love to hear any feedback or suggestions on this! Just gave a talk today at MLConf 2016 in San Francisco on Interpreting Black-Box Models with Applications to Healthcare. Was great to publicly release the first version of our ML Insights package for python. The github repository complete with code can be found here together with some nicely worked-out examples. Additional documentation can be found here. Thanks to Courtney and Nick for giving me the opportunity to present. The slides are here: mlc_model_interp_talk To whet your appetite, here are some visualizations you can easily create in a few lines after “pip install ml_insights”: One criticism of modern machine learning algorithms such as random forests and gradient boosting is that the models are difficult to interpret. Whereas in linear regression, the model makes assertions such as “every additional bathroom adds$25,000 to the value a home”, there is no simple equivalent in the tree-based regression methods.

Of course, the reality of the world is that there is no single number that gives the value of another bathroom to a property.  Adding a 4th bathrooom to a 1200 sq. ft. 2-bedroom condo may add little value, while adding a 2nd bathroom to a 1600 sq. ft., 4-bedroom house may yield an unusually large increase.  The right answer should depend both on “which bathroom” and the context of other variables.  So, one could argue, failing to give a simplistic, single-coefficient answer is a feature and not a bug.

In some sense, the more complicated, data-intensive (i.e. lower bias, higher variance) models are better able to capture these complicated dependencies, and thereby give more accurate predictions.  However, we are still left with a desire to understand what is going on inside of these “black-box” models both for our own understanding and to build our confidence in the accuracy of the model.

One approach that is gaining recognition is the use of partial dependence plots (Friedman).  The idea is to vary a single variable and examine how the output of a (black box) model changes.  When doing this, you will typically notice two things:

1. The resulting plot is not a line.
2. The resulting plot depends on the values of the other features.

I’m in the process of developing tools (with my colleague Ramesh Sampath) to generate such plots and make them easier to interpret.  Stay tuned for more details soon!

In most ML methods (including random forests, gradient boosting, logistic regression, and neural networks), the model outputs a score, which yields a “ranking classification“.  However, there are two very common mistakes that occur in dealing with this score:

1. Using a “default” threshold of 0.5 automatically to convert to a hard classification, rather than examing the performance across a range of thresholds.  (This is encouraged by sklearn’s convention that “model.predict” does precisely the former, while the latter requires the clunkier “model.predict_proba“)
2. Treating the score directly as a probability, without calibrating it.  This is patently wrong when using models like random forests (where the vote proportion certainly does not indicate the probability of being a ‘1’), and inaccurate even in logistic regression (where the output purports to be a probability, but often is not well calibrated).

We’ll dive into these mistakes in more detail in future posts.

Often students working on binary (0-1) classification problems would tell me that a particular model or approach “doesn’t work”.  When I would ask to see the results of the model (say, on a holdout set), they would show me a confusion matrix where the model predicted 0 for every data point.  When I asked what threshold they used, or what the ROC curve (or better yet, Precision-Recall Curve) looked like, only then would they start to realize that they had missed something important.

One very important aspect of binary classification that is (IMHO) not sufficiently stressed is that there are actually three different problems:

1. Hard Classification – firmly deciding to make a hard 0/1 call for each data point in the test set.
2. Ranking Classification – “scoring” each data point, where a higher score means more likely to be a ‘1’ (and thereby ranking the entire test set from most likely to least likely to be a ‘1’).
3. Probability Prediction – assigning to each point a (well-calibrated) probability that it is a ‘1’.

I recently finished teaching my first data science bootcamp at Metis. It was a great experience: interacting with really smart and driven students, learning by teaching, and getting all sorts of new ideas about data science.

Theme by Anders NorenUp ↑