Exploring H2O.ai

A few years ago RapidMiner incorporated a fantastic open source library from H2O.ai. That gave the platform Deep Learning, GLM, and a GBT algos, something they were lacking for a long time. If you were to look at my usage statistics, I’d bet you’d see that the Deep Learning and GLM algos are my favorites.

Just late last year H20.ai released their driverless.ai platform, an automated modeling platform that can scale easily to GPUs.

What I find fascinating is their approach to questioning each step of the way. The above video outlines that problem with lung tumor detection. Is your model learning the shape of the ribs or the size of tumor? You would hope it was the tumor! Fascinating video.

Innovating

I know that H2O.ai relentlessly drives their open source market and everywhere I look there’s an H20.ai library being imported or used. It wasn’t a shock to me to see an new update to their Driverless.Ai product, but what got me giddy was their incorporation of time series. This, I have to check out. Time Series can always be a pain and you can make mistakes easily, especially in the validation phase of things, but this just is plain cool. I definitely need to check this out more.

Video Highlights/Notes

  • Many Kaggle Grandmasters at H2O

  • Built Driverless.ai to avoid common pitfalls/mistakes of data science

  • Automate tasks: Cross validation, time series, feature engineering, etc

  • Ran it on a Kaggle challenge, came in 18th position.

  • Goal: Build robust models and avoid overfitting

  • Automatic visualization (of big data)

  • No outlier removal, it remains in big data set

  • Want to deploy a good model / must have an approximate interpretation

  • Java deployment package / driverless.ai will have a pure Java deployment

  • Not just talking about models but an entire model pipeline (feature generation, model building, stacking, etc)

  • Typically deployed to a Linux box

  • Will be building a Java scoring logic to score the model pipeline (on roadmap)

  • Sparkling Water will be incorporated into Driverless.AI so you can run this easily on Big Data

  • Want to write R/Py scripts to interact with Driverless.AI so it will make sense to the Data Scientist and not be complex and easy to use

  • Deep Learning is inside but not enabled yet

  • Compromise: If you want to train many models you select a good sized training set but not huge. There is a # of models vs training time tradeoff

  • User defined functions coming

  • Import the training and testing data. Model will built on training data only (won’t look at testing data)

  • Does batch style transformations instead of row by row for training

  • BUT it will do row by row transformations for testing set

  • Uses a genetic algo to create new features

  • Checks overfitting and stops early based on a holdout

  • Uses methods to evaluate and prevent overfitting

  • Only validation scores are provided (out of sample estimates)

  • Interpretability is built in

  • After the model is created, you can build a stacked model

  • Download scoring package, all built in so you can put this into production



Date
August 10, 2018