H2O Tutorials

Introduction To Driverless Ai

I finally posted a new video on my YouTube channel after a year of no activity. It felt good and is part of my ‘content refresh’ project I’m working on. In this video I do an introduction to Driverless AI and its EDA capabilities. The forthcoming videos will go into the training, testing, diagnostics, machine learning interpretability, and much more. Please drop a comment or question in the channel if you have any.

Pro Tip: These books have been helpful to me when just starting to learn Python. I’m recommending them to you as well. They are affiliate links and if you buy them I’ll get a small commission to help keep this blog running. Please consider buying these books with the links below.

Python Datatable (From H2O.ai)

I missed this presentation at H2O World and I’m glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R’s data.table called simply: datatable.

I’m going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Datatable Notes

Introduction to using the open source datatable
9 million rows in 7 seconds??
Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:- Has a Python fronted with a C++ blackened
Parallelized with OpenMP and Hogwild
Supports boolean, integer, real, and string functions
Hashing trick based on Murmur hash function
Second-order feature interactions
One-vs-rest multinomial class-action and regression targets (experimental)
As simple as ‘import datatable as dt’
Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
Datatable comes from the popular R data.table package
When Driverless AI started, we knew Pandas was a problem
Pandas is memory hungry
Realized we needed a python version of datatable
The first customer is Driverless AI
Wanted it to be multithreaded and efficient
Memory thrifty
Memory mapped on data sets (data set can live in memory or on disk)
Native C++ implementation
Open Source
Fread: A doorway to Driverless AI, reading in data
Next step in DAI is to save it to a binary format
The file is called ‘.jay’
Check it with ‘%%timit’
Opening a .jay file is nearly instant
Syntax is very SQL like, if you’re familiar with R’s data.table, then you can get this
See timestamp 16:00 is basic syntax in use

H2O.ai, datatable, python

Question and Answers

Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it’s still being built out

Making Ai Happen Without Getting Fired (Marketing Applications)

I watched Mike Gualtieri’s keynote presentation from H2O World San Francisco (2019) and found it to be very insightful from a non-technical MBA type of way. The gist of the presentation is to really look at all the business connections to doing data science. It’s not just about the problem at hand but rather setting yourself up for success, and as he puts it, not getting fired!

My notes from the video are below (emphasis mine):

Set the proper expectations
There is a difference between Pure AI and Pragmatic AI
Pure AI is like what you see in movies (i.e. ExMachina)
Pragmatic AI is machine learning. Highly specialized in one thing but does it really well
Chose more than one use case
The use case you choose could fail. Choose many different kinds
Drop the ones that don’t work and optimize the ones that do
Ask for comprehensive data access
Data will be in silos
Get faster with AutoML
Data Scientists aren’t expensive, they need better tools to be more efficient
Three segments of ML tools- Multimodel (drag and drop like RapidMiner/KNIME)
Notebook-based (like Jupyter Notebook)
Automation-focused (like Driverless AI)
Use them to augment your work, go faster
Warning: Data-savvy users can use these tools to build ML. Can be dangerous but they can vet use cases
Know when to quit
Sometimes the use case won’t work. There is no signal in the data and you must quit
Stop wasting time
Keep production models fresh
When code is written, it’s written the same way and runs the same forever
ML Models decay, so you need to figure out how to do it at scale
Model staging, A/B testing, Monitoring
Model deployment via collaboration with DevOps
Get Business and IT engaged early
They have meetings with business and IT, get ducks in a row
Ask yourself, how is it going to be deployed and how it will impact business process
Ignore the model to protect the jewels
You don’t have to do what the model tells you to do (i.e False Positives, etc)
Knowledge Engineering: AI and Humans working together
Explainability is important

Time Series for H2o With Modeltime

Matt Dancho, founder of Business Science
Introduced to H2O-3 via the AutoML package
Sample code in R shared
Sample forecasting project / Walmart Sales
Tidymodels standardize machine learning packages
Modeltime loads H2O
Multiple time series
Create a forecast time horizon, assess 52 weeks forecast
Create preprocessing steps, helps the H2O algos to find good features
Some columns are normalized from the pre-processing
Extracted Time related features (i.e. week number, day of the week, etc)
Initializes H2O-3 / Stacked Ensemble model will be the best but hard to interpret
Modeltime workflow starts with a table
Modeltime is an organiational tool
Modeltime Calibrate will extract the residuals of the models
Visualize the forecast on the test set generates nice charts
Built a single H2O-3 model to predict on 7 different time series
This is very scalable, instead of looping through everything
Refit the model on the entire training data and then did a forward walk of 52 weeks
Modeltime ecosystem was created to help with higher frequent time series, at scale, that’s automated

Date

August 25, 2020

2019 11 05 Sp500

Previously

Sourdough

Neural Market Trends