One criticism of modern machine learning algorithms such as random forests and gradient boosting is that the models are difficult to interpret.  Whereas in linear regression, the model makes assertions such as “every additional bathroom adds $25,000 to the value a home”, there is no simple equivalent in the tree-based regression methods.

Of course, the reality of the world is that there is no single number that gives the value of another bathroom to a property.  Adding a 4th bathrooom to a 1200 sq. ft. 2-bedroom condo may add little value, while adding a 2nd bathroom to a 1600 sq. ft., 4-bedroom house may yield an unusually large increase.  The right answer should depend both on “which bathroom” and the context of other variables.  So, one could argue, failing to give a simplistic, single-coefficient answer is a feature and not a bug.

In some sense, the more complicated, data-intensive (i.e. lower bias, higher variance) models are better able to capture these complicated dependencies, and thereby give more accurate predictions.  However, we are still left with a desire to understand what is going on inside of these “black-box” models both for our own understanding and to build our confidence in the accuracy of the model.

One approach that is gaining recognition is the use of partial dependence plots (Friedman).  The idea is to vary a single variable and examine how the output of a (black box) model changes.  When doing this, you will typically notice two things:

  1. The resulting plot is not a line.
  2. The resulting plot depends on the values of the other features.

I’m in the process of developing tools (with my colleague Ramesh Sampath) to generate such plots and make them easier to interpret.  Stay tuned for more details soon!