Exploring H2O.ai
A few years ago RapidMiner incorporated a fantastic open source library from H2O.ai. That gave the platform Deep Learning, GLM, and a GBT algos, something they were lacking for a long time. If you were to look at my usage statistics, I’d bet you’d see that the Deep Learning and GLM algos are my favorites.
Just late last year H20.ai released their driverless.ai platform, an automated modeling platform that can scale easily to GPUs.
What I find fascinating is their approach to questioning each step of the way. The above video outlines that problem with lung tumor detection. Is your model learning the shape of the ribs or the size of tumor? You would hope it was the tumor! Fascinating video.
Innovating
I know that H2O.ai relentlessly drives their open source market and everywhere I look there’s an H20.ai library being imported or used. It wasn’t a shock to me to see an new update to their Driverless.Ai product, but what got me giddy was their incorporation of time series. This, I have to check out. Time Series can always be a pain and you can make mistakes easily, especially in the validation phase of things, but this just is plain cool. I definitely need to check this out more.
Video Highlights/Notes
Many Kaggle Grandmasters at H2O
Built Driverless.ai to avoid common pitfalls/mistakes of data science
Automate tasks: Cross validation, time series, feature engineering, etc
Ran it on a Kaggle challenge, came in 18th position.
Goal: Build robust models and avoid overfitting
Automatic visualization (of big data)
No outlier removal, it remains in big data set
Want to deploy a good model / must have an approximate interpretation
Java deployment package / driverless.ai will have a pure Java deployment
Not just talking about models but an entire model pipeline (feature generation, model building, stacking, etc)
Typically deployed to a Linux box
Will be building a Java scoring logic to score the model pipeline (on roadmap)
Sparkling Water will be incorporated into Driverless.AI so you can run this easily on Big Data
Want to write R/Py scripts to interact with Driverless.AI so it will make sense to the Data Scientist and not be complex and easy to use
Deep Learning is inside but not enabled yet
Compromise: If you want to train many models you select a good sized training set but not huge. There is a # of models vs training time tradeoff
User defined functions coming
Import the training and testing data. Model will built on training data only (won’t look at testing data)
Does batch style transformations instead of row by row for training
BUT it will do row by row transformations for testing set
Uses a genetic algo to create new features
Checks overfitting and stops early based on a holdout
Uses methods to evaluate and prevent overfitting
Only validation scores are provided (out of sample estimates)
Interpretability is built in
After the model is created, you can build a stacked model
Download scoring package, all built in so you can put this into production