H2O Tutorials

H2O Tutorials
Photo by Trollinho / Unsplash

Introduction to Driverless AI

I finally posted a new video on my YouTube channel after a year of no activity. It felt good and is part of my 'content refresh' project I'm working on. In this video, I do an introduction to Driverless AI and its EDA capabilities. The forthcoming videos will go into the training, testing, diagnostics, machine learning interpretability, and much more. Please drop a comment or question in the channel if you have any.

Python Datatable (from H2O.ai)

I missed this presentation at H2O World and I'm glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R's data.table called simply: datatable.

I'm going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Datatable Notes

  • Introduction to using the open-source  datatable
  • 9 million rows in 7 seconds??
  • Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:- Has a Python fronted with a C++ blackened
  • Parallelized with OpenMP and Hogwild
  • Supports boolean, integer, real, and string functions
  • Hashing trick based on Murmur hash function
  • Second-order feature interactions
  • One-vs-rest multinomial class-action and regression targets (experimental)
  • As simple as 'import datatable as dt'
  • Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
  • Datatable comes from the popular R data.table package
  • When Driverless AI started, we knew Pandas was a problem
  • Pandas is memory hungry
  • Realized we needed a python version of datatable
  • The first customer is Driverless AI
  • Wanted it to be multithreaded and efficient
  • Memory thrifty
  • Memory mapped on data sets (data set can live in memory or on disk)
  • Native C++ implementation
  • Open Source
  • Fread: A doorway to Driverless AI, reading in data
  • Next step in DAI is to save it to a binary format
  • The file is called '.jay'
  • Check it with '%%timit'
  • Opening a .jay file is nearly instant
  • Syntax is very SQL like, if you're familiar with R's data.table, then you can get this
  • See timestamp 16:00 is basic syntax in use

Question and Answers

  • Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
  • Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it's still being built out

Making AI Happen Without Getting Fired (Marketing Applications)

I watched Mike Gualtieri's keynote presentation from H2O World San Francisco (2019) and found it to be very insightful from a non-technical MBA type of way.  The gist of the presentation is to really look at all the business connections to doing data science. It's not just about the problem at hand but rather set yourself up for success, and as he puts it, not getting fired!

My notes from the video are below (emphasis mine):

  • Set the proper expectations
  • There is a difference between Pure AI and Pragmatic AI
  • Pure AI is like what you see in movies (i.e. ExMachina)
  • Pragmatic AI is machine learning. Highly specialized in one thing but does it really well
  • Chose more than one use case
  • The use case you choose could fail. Choose many different kinds
  • Drop the ones that don't work and optimize the ones that do
  • Ask for comprehensive data access
  • Data will be in silos
  • Get faster with AutoML
  • Data Scientists aren't expensive, they need better tools to be more efficient
  • Three segments of ML tools- Multimodel (drag and drop like RapidMiner/KNIME)
  • Notebook-based (like Jupyter Notebook)
  • Automation-focused (like Driverless AI)
  • Use them to augment your work, go faster
  • Warning: Data-savvy users can use these tools to build ML. Can be dangerous but they can vet use cases
  • Know when to quit
  • Sometimes the use case won't work. There is no signal in the data and you must quit
  • Stop wasting time
  • Keep production models fresh
  • When code is written, it's written the same way and runs the same forever
  • ML Models decay, so you need to figure out how to do it at scale
  • Model staging, A/B testing, Monitoring
  • Model deployment via collaboration with DevOps
  • Get Business and IT engaged early
  • They have meetings with business and IT, get ducks in a row
  • Ask yourself, how is it going to be deployed and how it will impact the business process
  • Ignore the model to protect the jewels
  • You don't have to do what the model tells you to do (i.e False Positives, etc)
  • Knowledge Engineering: AI and Humans working together
  • Explainability is important

Time Series for H2O with Modeltime

  • Matt Dancho, Founder of Business Science
  • Introduced to H2O-3 via the AutoML package
  • Sample code in R shared
  • Sample forecasting project / Walmart Sales
  • Tidymodels standardize machine learning packages
  • Modeltime loads H2O
  • Multiple time series
  • Create a forecast time horizon, assess 52 weeks forecast
  • Create preprocessing steps, helps the H2O algos to find good features
  • Some columns are normalized from the pre-processing
  • Extracted Time related features (i.e. week number, day of the week, etc)
  • Initializes H2O-3 / Stacked Ensemble model will be the best but hard to interpret
  • Modeltime workflow starts with a table
  • Modeltime is an organiational tool
  • Modeltime Calibrate will extract the residuals of the models
  • Visualize the forecast on the test set generates nice charts
  • Built a single H2O-3 model to predict on 7 different time series
  • This is very scalable, instead of looping through everything
  • Refit the model on the entire training data and then did a forward walk of 52 weeks
  • Modeltime ecosystem was created to help with higher frequent time series, at scale, that's automated