MLI with RSparkling

2019-06-01 00:00:00 AI Machine Learning h2oai

Last evening my colleague Navdeep Gill (@Navdeep_Gill_) posted a link to his latest talk titled “Interpretable Machine Learning with RSparkling.” Navdeep is part of our MLI team and has a wealth of experience to share about explaining black boxes with modern techniques like Shapley values and LIME.

Machine Learning Interpretability (MLI)

H2O has this awesome open source Big Data software called Sparkling Water. It’s similiar to RapidMiner’s Radoop but 1) open source, 2) more powerful, and 3) been tested by the masses. It’s stable and runs on many a Hadoop cluster with Spark. The neat thing about Sparkling Water is that you can take the H2O.ai Algorithms and push them down to the cluster to train on your ‘Big Data.’ There are quite a few powerful, fast, and accurate algorithms that H2O-3 has. H2o-3 is the current version of the open source set of algorithms and H2O.ai continues to develop this suite over time. Most recently they added [Isolation Forests](/isolation-forests- h2oai/)!

Surrogate Models/Shapley Values/RSparkling

Sparkling Water let’s Data Scientists and Hadoop DevOps people connect and work how they want too. You can connect via Scala, R, and Python. In Navdeep’s talk, he uses R to connect to Sparkling Water, hence RSparkling. His presentation goes into a few basics of H2O, Sparkling Water, and R but ends with the fascinating topic of Machine Learning Interpretability (MLI). He shows how simple interpretability can be done using H2O-3’s GLM and CoxPH algorithms (H2O-3 is also open source).

While the GLM and CoxPH algos are part of a ‘Surrogate Modeling’ technique, they are NOT accurate. Instead they are approximations of the decision boundary. While this is fast, it might not be what you need. For a highly accurate method, you want to look use Shapley Values.

H2O-3 allows you to use Tree Shap for XGBoost and GBM algorithms. These, of course, can be used with R and Sparkling Water, so now you can generate ‘reason codes’ for billions of rows of data right on your cluster.

More Information

Navdeep’s presentation ends with a demo of a large credit card data set on Hadoop and with a slide for more information. I suggest that you check out this free O’Reilly book that H2O-3 published with Navdeep being a co-author. It goes into the heavy math of MLI but it’s only 40 pages long, so it’s a great read on a flight.