Is DataRobot in Trouble?

The news coming out of DataRobot lately has given me pause. Is a power struggle happening? Are sales flagging? I’m not sure what it is but there are a lot of interesting LinkedIn and Glassdoor Reviews out there for sure.

Before we dive in, here’s my big disclaimer: I work for H2O.ai, this post is my personal observation of what’s going on in an industry that I’m a part of. My views don’t reflect that of my employers or any other employees there.

The big news is that DataRobot laid off 7% of its staff after a very aggressive hiring stint.

…artificial intelligence company DataRobot cut more than 7,500 jobs globally from April 1 to May 16, according to employment tracker Layoffs.fyi. [via Bloomberg]

It gets better, there have been Board departures and former employees speaking out.

Former Board member Christopher Lynch posts:

Shareholder Jacob S laments the recent layoffs that allegedly happened after the Sales Kick-off and the President’s Club celebration. Talk about bad optics! That’s right up there with the “let them eat cake.”

I did do a search for any timestamped photos but I couldn’t find any at this time. My guess? They’ve all been scrubbed.

Of course, the Glassdoor Reviews are all over the place, from praise to downright scathing.

That’s to be expected as the firm tries to manage the message out there but I offer snapshots of two negative reviews. One from before the layoffs and one during (if the April 1 to May 16th layoff dates are correct).

I get it, selling AI software is hard. I’m in Sales, just like that reviewer above. Selling a complex product that is only understood by highly trained people, even if you abstract a lot of it, is so damn hard.

To sell a complex product you need a very focused organization, with clear communication, and a stellar go-to-market (GTM) strategy to sell effectively.

It sounds like there are organizational and communication problems inside DataRobot if you were to believe the review below.

Growing pains

There are a lot of good things about the DataRobot product from what I’ve seen, especially their UI, but after raising so much money and acquiring a bunch of other companies, you would think they had their “stuff” better together.

For the first time since I can remember, we got to peer in behind the curtain of a secretive organization and see what was happening.

What do I see? I see massive growing pains at DataRobot. I see the pain of being under pressure to perform. I see a flat startup organizational structure being forced into the traditional hierarchy structure.

I want to wish them luck, they were the first company to create an AutoML product and they built a unicorn! But unicorns are mythical creatures and money talks, and that’s all that matters in this market.

How StockTwits Uses Machine Learning

Fascinating behind the scenes interview of StockTwits Senior Data
Scientist Garrett Hoffman.


He shares great tidbits on how StockTwits uses machine learning for
sentiment analysis. I’ve summarized the highlights below:

  • Idea generation is a huge barrier for active trading
  • Next gen of traders uses social media to make decisions
  • Garrett solves data problems and builds features for the StockTwits
    platform
  • This includes: production data science, product analytics, and
    insights research
  • Understanding social dynamics makes for a better user experience
  • Focus is to understand social dynamics of StockTwits (ST) community
  • Focuses on what’s happening inside the ST community
  • ST’s market sentiment model helps users with decision making
  • Users ‘tag’ content for bullish or bearish classes
  • Only 20 to 30% of content is tagged
  • Using ST’s market sentiment model increases coverage to 100%
  • For Data Science work, Python Stack is used
  • Use: Numpy, SciPy, Pandas, Scikit-Learn
  • Jupyter Notebooks for research and prototyping
  • Flask for API deployment
  • For Deep Learning, uses Tensorflow with AWS EC2 instances
  • Can spin up GPU’s as needed
  • Deep Learning methods used are Recurrent Neural Nets, Word2Vec, and
    Autoencoders
  • Stays abreast of new machine learning techniques from blogs,
    conferences and Twitter
  • Follows Twitter accounts from Google, Spotify, Apple, and small tech
    companies
  • One area ST wants to improve on is DevOps around Data Science
  • Bridge the gap between research/prototype phase and embedding it into
    tech stack for deployment
  • Misconception that complex solutions are best
  • Complexity ONLY ok if it leads to deeper insight
  • Simple solutions are best
  • Future long-term ideas: use AI around natural language

Data Science and Machine Learning Roundup

First off is Julia Language for Data Engineering (Medium link). Author Logan Kilpatrick writes how to use Julia Language packages Dataframs.jl and CSV.jl to do some basic data engineering. He doesn’t stop there but shows you how to work with databases in Julia Language as well and shares a lot of links to videos and additional information.

Meta AI, which I believe might’ve been Facebook AI, has released a new open-source package called data2vec. Data2vec operates in the self-supervised area of machine learning. Self-supervised learning is explained as:

Self-supervision enables computers to learn about the world just by observing it and then figuring out the structure of images, speech, or text. Having machines that don’t need to be explicitly taught to classify images or understand spoken language is simply much more scalable. – via Meta AI.

Are you interested in running machine learning on small and low-powered chips? How about doing deep learning on tiny devices? That’s the goal of TinyML.

TinyML seeks to bring the power of deep learning to small microcontrollers and chips are cheap to produce, use low power, and can run on a battery. I really like this idea since I have several Raspberry Pi computers and a Jetson Nano.

This is a treasure trove of data science cheat sheets. Bookmark this Kaggle page, you will thank me for it.

In other news, it looks like Alteryx is trying to stay relevant by snapping up “holy shit are they still around” Big Data profiling company Trifacta.

Last but not least, how bad was Zillow’s Zestimate? It looks like it was way off the market and the company is shedding more than 20% of its workforce. OUCH! They incurred a $420 million dollar loss. More OUCH!

Isolation Forests in H2O.ai

A new feature has been added to H2O-3 open-source, isolation forests. I’ve always been a fan of understanding outliers and love using One-Class SVM’s as a method, but the isolation forests appear to be better in finding outliers, in most cases.

From the H2O.ai blog:

There are multiple approaches to an unsupervised anomaly detection problem that try to exploit the differences between the properties of common and unique observations. The idea behind the Isolation Forest is as follows.

We start by building multiple decision trees such that the trees isolate the observations in their leaves. Ideally, each leaf of the tree isolates exactly one observation from your data set. The trees are being split randomly. We assume that if one observation is similar to others in our data set, it will take more random splits to perfectly isolate this observation, as opposed to isolating an outlier.

For an outlier that has some feature values significantly different from the other observations, randomly finding the split isolating it should not be too hard. As we build multiple isolation trees, hence the isolation forest, for each observation we can calculate the average number of splits across all the trees that isolate the observation. The average number of splits is then used as a score, where the less splits the observation needs, the more likely it is to be anomalous.

While there are other methods of outlier detection like LOF (local outlier factor), it appears that Isolation Forests tend to be better than One-Class SVM’s in finding outliers.

See this handy image from the Scikit-Learn site:

Anomaly Detection Comparison
Anomaly Detection Comparison

Interesting indeed. I plan on using this new feature on some work I’m doing for customers.

Shapley Values – A Gentle Introduction | H2O.ai

Shapley was studying cooperative game theory when he created this tool. However, it is easy to transfer it to the realm of machine learning. We simply treat a model’s prediction as the ‘surplus’ and each feature as a ‘farmer in the collective.’ The Shapley value tells us how much impact each element has on the prediction, or (more precisely) how much each feature moves the prediction away from the average prediction.

Shapley Values – A Gentle Introduction | H2O.ai

Labeling Training Data Correctly

When you’re dealing with a classification problem in machine learning, good labeled data is crucial. The more time you spend labeling training data correctly, the better. This is because your model’s performance and deployment will depend on it. Always remember that garbage in means garbage out.

Thoughts on labeling data

I recently listened to a great O’Reilly podcast on this subject. They interviewed Lukas Biewald, Chief Data Scientist and Founder of CrowdFlower. CrowdFlower provides their clients with top notch labeled training data for various machine learning tasks, and they’re busy!
 
The few bits that caught my ear were how much of the training data is used in deep learning. They’re also seeing more image labeled data for self driving cars.
 
The best part of the interview as Lukas’s discussion on using a Raspberry Pi with Tensor Flow! How cool is that?

The Podcast

My Chinese Big Brother

 

One of my Asian friends recently posted a link to a terrifying use of Machine Learning. This is what I call the “dark side” of this field, the use of machine learning by a government to make you behave a certain way.

1984’s Big Brother

China is building its own version of 1984’s Big Brother, a massive scoring system that’s probably a large scale classification algorithm most likely sitting on top of a Big Data structure like Hadoop. I call this an utter and complete abuse of machine learning.

The government hasn’t announced exactly how the plan will work — for example, how scores will be compiled and different qualities weighted against one another. But the idea is that good behavior will be rewarded and bad behavior punished, with the Communist Party acting as the ultimate judge.

Yes, we use classification and scoring system in everyday commerce. Banks use it to grade your credit card worthiness. Businesses use it to determine your propensity to buy their product. This project goes beyond all that.

Social Trust

China wants to learn if you say and do anything to break “social trust.”

They will scrap online sources available and assign you a score. This score will determine your trustworthiness. How they define trustworthiness remains the question.

This can manipulated and gamed. Questionable data quality can lead to misclassification and mistakes and Chinese people will suffer those consequences.

IMHO China’s drive for a harmonious society with a “Asian Big Brother” will only hasten it’s demise.

%d bloggers like this: