Lead Scoring with H2O and Streamlit

2020-07-14 00:00:00 Tutorials Python Streamlit H2O.ai UI

Streamlit has to be one of my favorite new libraries out there and I experiment with it when I have time. For fun I wanted to see how easy it would be to build a simple UI with Streamlit and wrap H2O-3’s AutoML into it, to build your own AutoML tool. In about 100 lines of code I was able to make a simple website that lets users upload a training data set and a scoring data set.

It runs H2O-3 AutoML for a binary type of problem and then gives you the predicted results. It’s a simple POC (Proof of Concept) but it was fun nonetheless because Streamlit is so lightweight to use.

Pro Tip: These books have been helpful to me when just starting to learn Python. I'm recommending them to you as well. They are affiliate links and if you buy them I'll get a small commission to help keep this blog running. Please consider buying these books with the links below.

First initialize the required Python libraries.

This is pretty straight forward, you’ll need Streamlit, Pandas, H2O, and time.


import streamlit as st

#import numpy as np
import pandas as pd

import time

import h2o
from h2o.automl import H2OAutoML

Set some environmental variables

I decided to keep this code focused on binary class problems, but with H2O-3’s AutoML you can extend it to regression type problems too. That would be for another tutorial.

All I did was set the seed number, put the max folds for Cross Validation, the early stopping metric, and the maximum models to train. If you’ve ever used AutoML from H2O, it’s really nice and powerful.

#Set some variables here (use this area to set up parameters for AutoML and datasets)
seed = 1234
nfolds = 5
stopping_metric = 'AUC'
max_models = 10

Uploading a Training and Scoring Dataset

Here I just read through the Streamlit documentation to figure out how to do this. I loaded up two datasets, one for training and one for scoring. I assigned them to pandas dataframes to manipulated them and then sent them an H2O dataframe for modeling and predicting. Note, I added a random sample of the scoring data to make this go faster, you may comment that out if you like.

ploaded_file = st.file_uploader("Choose a CSV file to train the Lead Scoring model", type="csv")
st.write('Note, you should remove ID values from the training set.')

if uploaded_file is not None:
    data = pd.read_csv(uploaded_file)
    with st.spinner('Wait for it...'):
        time.sleep(5)
        st.success('Done!')
    st.write(data)


target = st.text_input('Select the Target Variable (note: this is case sensitive)', 'Converted')

#ID = st.text_input('Select the ID column (note: this is case sensitive)', 'ID')

uploaded_file = st.file_uploader("Choose a CSV file to make predictions on the model", type="csv")

if uploaded_file is not None:
    load = pd.read_csv(uploaded_file)
    predict = load.sample(5)
    st.write("We'll do a random sample of 5 rows", predict)

Kick of Training and Scoring

Now comes the fun part, the training and scoring. Streamlit let’s you add buttons and all kinds of standard web things to interact with your application. I just made a ‘Kick Off Training and Scoring’ button that launched the H2O cluster. It loaded in the pandas dataframe, built 10 models in AutoML, and grabbed the best model based on AUC. After that I grabbed the 2nd best model for variable importance.

Why did I do that? There appears to be an issue with feature importance when using the top stacked ensemble model, so for this POC we just grabbed the 2nd best and lived with it.

It grabs the 2nd best model and the scores on it, displaying the results.

if st.button('Kick off Training & Predictions'):
    st.write('Intializing Training Cluster...')
    #Initialize H2O cluster
    h2o.init()

    st.write('Loading Data into Model...')

    train = h2o.H2OFrame(data)
    train,test= train.split_frame(ratios=[.7])

    # Identify predictors and response
    x = train.columns
    y = target
    x.remove(y)

    train[y] = train[y].asfactor()
    test[y] = test[y].asfactor()

    # Run AutoML for 20 base models (limited to 1 hour max runtime by default)
    with st.spinner('Wait for it...'):
        aml = H2OAutoML(max_models=max_models, seed=seed, nfolds=nfolds, stopping_metric=stopping_metric, exclude_algos = ["StackedEnsemble", "DeepLearning", "DRF"])
        aml.train(x=x, y=y, training_frame=train)
        time.sleep(5)
        st.success('Done!')

    # View the AutoML Leaderboard
    lb = aml.leaderboard
    #lb.head(rows=lb.nrows)

    # Get Leader Accuracy
    perf_leader = aml.leader.model_performance(test).auc()
    st.write("The best model accuracy for the model (AUC) is:", str(perf_leader))

    perf_f1 = aml.leader.model_performance(test).F1()
    st.write("The best model accuracy for the model (F1) is:", str(perf_f1))

    m = h2o.get_model(lb[2,"model_id"])
    FI = m.varimp(use_pandas=True)

    st.write("Important Features", FI)

    # Get predictions
    preds = aml.predict(test)
    print(preds)

    predict_frame = h2o.H2OFrame(predict)
    preds = aml.predict(predict_frame)

    st.write('Intializing Prediction Cluster...')

    st.write(preds)

    tmp = preds.as_data_frame()
    tmp2 = predict_frame.as_data_frame()
    out = pd.merge(tmp, tmp2, left_index=True, right_index=True)
    st.write('Here are the predictions with probabilities:', out)

    st.write('Shutting down Training and Prediction Clusters...')
    h2o.cluster().shutdown()
    st.write("Thank you and Goodbye.")
else:
    pass

Complete Python Code

Below is the complete code.


import streamlit as st

import numpy as np
import pandas as pd

import time

import h2o
from h2o.automl import H2OAutoML

#Set some variables here (use this area to set up parameters for AutoML and datasets)
seed = 1234
nfolds = 5
stopping_metric = 'AUC'
max_models = 10

uploaded_file = st.file_uploader("Choose a CSV file to train the Lead Scoring model", type="csv")
st.write('Note, you should remove ID values from the training set.')

if uploaded_file is not None:
    data = pd.read_csv(uploaded_file)
    with st.spinner('Wait for it...'):
        time.sleep(5)
        st.success('Done!')
    st.write(data)


target = st.text_input('Select the Target Variable (note: this is case sensitive)', 'Converted')

#ID = st.text_input('Select the ID column (note: this is case sensitive)', 'ID')

uploaded_file = st.file_uploader("Choose a CSV file to make predictions on the model", type="csv")

if uploaded_file is not None:
    load = pd.read_csv(uploaded_file)
    predict = load.sample(5)
    st.write("We'll do a random sample of 5 rows", predict)


if st.button('Kick off Training & Predictions'):
    st.write('Intializing Training Cluster...')
    #Initialize H2O cluster
    h2o.init()

    st.write('Loading Data into Model...')

    train = h2o.H2OFrame(data)
    train,test= train.split_frame(ratios=[.7])

    # Identify predictors and response
    x = train.columns
    y = target
    x.remove(y)

    train[y] = train[y].asfactor()
    test[y] = test[y].asfactor()

    # Run AutoML for 20 base models (limited to 1 hour max runtime by default)
    with st.spinner('Wait for it...'):
        aml = H2OAutoML(max_models=max_models, seed=seed, nfolds=nfolds, stopping_metric=stopping_metric, exclude_algos = ["StackedEnsemble", "DeepLearning", "DRF"])
        aml.train(x=x, y=y, training_frame=train)
        time.sleep(5)
        st.success('Done!')

    # View the AutoML Leaderboard
    lb = aml.leaderboard
    #lb.head(rows=lb.nrows)

    # Get Leader Accuracy
    perf_leader = aml.leader.model_performance(test).auc()
    st.write("The best model accuracy for the model (AUC) is:", str(perf_leader))

    perf_f1 = aml.leader.model_performance(test).F1()
    st.write("The best model accuracy for the model (F1) is:", str(perf_f1))

    m = h2o.get_model(lb[2,"model_id"])
    FI = m.varimp(use_pandas=True)

    st.write("Important Features", FI)

    # Get predictions
    preds = aml.predict(test)
    print(preds)

    predict_frame = h2o.H2OFrame(predict)
    preds = aml.predict(predict_frame)

    st.write('Intializing Prediction Cluster...')

    st.write(preds)

    tmp = preds.as_data_frame()
    tmp2 = predict_frame.as_data_frame()
    out = pd.merge(tmp, tmp2, left_index=True, right_index=True)
    st.write('Here are the predictions with probabilities:', out)

    st.write('Shutting down Training and Prediction Clusters...')
    h2o.cluster().shutdown()
    st.write("Thank you and Goodbye.")
else:
    pass

End Notes

I really simplified this and even if you got an AUC of 0.99 I would be very suspect. The dataset I experimented with was a lead scoring data set from Kaggle that had 2 leaking features and I dropped some columns because they were causing the AutoML to crash, so there’s a lot of stuff that I would need to build in the code for error checking. On top of all this, I didn’t do ANY feature engineering whatsoever so this model could use work. Plus, I should’ve added some Matplotlib or Plotly visualizations.

Still, this was a fun exercise indeed.