Guide to Getting Started in Data Science
This is the forward to an updated ultimate guide on getting started in data science. I wanted to write a set of ‘getting started’ posts to share with readers on how I became a data scientist at RapidMiner. How I went from a Civil Engineer with an MBA to working for an amazing startup. Granted, I’m not a classically trained Data Scientist, I hardly knew how to code in the beginning but with the right tools and attitude, you can ‘huff’ your way into this field. Will you be a Data Scientist after reading these series of posts? Of course not, but you’ll have a framework to move forward or at the least, have a better understanding what we do.
Lately I’ve been answering questions on Quora. People have been asking how to become a data scientist, the fast and easy way. There is NO FAST or EASY WAY to become a Data Scientist. It requires strong critical reasoning skills, math, curiosity, and the ability to code. That’s it in a nutshell. If you have the first three, you can use a platform like Orange, RapidMiner, or KNIME to skip the coding part BUT you should (must) learn to code.
Introduction
My journey into data science starts with my engineering degree. It taught me the basics of statistics, math, critical thinking, and even how to program in Fortran. I promptly forgot how to code in Fortran, which it turns out was a mistake on my part. I worked as a civil engineer for close to 6 years before I decided to get an MBA with a specialization in technology. It was in MBA school that I took a course titled “Data Mining for Managers” taught by Dr. Stephan Kudyba. Dr. Kudyba turned me on to a passion that I didn’t know existed. The very thought of ‘mining’ data for statistical relationships got me so excited that I ended up starting this website/blog back in 2007. He was the match to this fire.
It was after his class that I found YALE, the initial alpha version of what became RapidMiner. Right of the bat I could tell that it was feature rich but incredible hard to understand or use. You had to be a PhD to figure that out, which was true since the Founders (Ingo & Ralf) were PhD students at the time. In my Data Mining for Managers class, we never talked about Cross Validation. We never created a confusion matrix or calculated precision/recall. We just talked about ETL and data preparation, modeling with a Neural Net, and consuming the results. So, I had work to do if I wanted to use this tool.
Education
You will need some sort of advanced degree as the foundation for critical thinking, so a college degree is a must. However, this doesn’t mean that you need to come from a STEM related background to become a data scientist (it is helpful), you just need to be able to understand math and the methods of scientific analysis.
You can always take a Udemy or Code Academy course to supplement your education if you wanted to learn Python or R, a needed skill (see below).
Coding vs No Coding
Even though it was hard, I chose YALE/RapidMiner because I didn’t need to code. I didn’t have time teach myself a programming language, which probably was a mistake on my part as I now reflect. I had the chance to take some Java classes back then but decided not too. If I had to do it all over again, I would choose either Java or Python and learn it from the very beginning. Java if I wanted to really build out RapidMiner and Python because it’s fast to prototype and easy to work with.
In fact, all Data Scientists will need to code these days, you can’t escape that fact and you will be judged on this ability when you apply for a Data Scientist position. You will be given a problem or a project to work on and then make a presentation with your results and code. You will be judged on this and your future employment depends on that.
So, what do Data Scientists code in? This will change from year to year as some new hot language comes out. I suggest you check out the yearly KD Nuggets poll on what data scientists are using. I do offer some suggestions below and my recommendation is to pick two but become really proficient in one.
Java
Java is a statically typed language. It means you have to explicitly declare variables and takes more time to write your program. The data science benefit is that platforms like RapidMiner and KNIME run on it, so it’s platform independent. H20.ai also lets you export its process as a POJO file (Plain Old Java Object) so you can quickly put it into production. Then there’s WEKA, another java based data science platform. The upside to Java is that it’s very mature and has a ton of libraries to use. Note: H20.ai is not java based, it runs in your browser.
Another added benefit is if you’re working with Hadoop, Java works well too. Of course, every Hadoop distribution will be different, but generally it supports Java. If Hadoop and Big Data interests you, then also look into Scala. Scala is very similar to Java.
R
I’ll start with a disclaimer, I’ve used R and it has some great packages but I find it clunky. This is my personal bias and I’ve worked with people who absolutely love R. It’s a very feature rich open source software that let’s you do all aspects of machine learning with some of the best graphics libraries out there. A lot of universities teach data science related course on R and I completely understand it. It’s not as heavy to code as Java and it is a bit easier than python in my opinion, but you have to know the syntax. It’s a bit harder to put into production and you can use it on Hadoop via SparkR. You can download it right away and get started with the 1000’s of video tutorials out there.
If you’re going to work with R, I suggest downloading R Studio. It’s a very nice workbench that let’s you write R scripts, load data, and display charts right in one neatly organized place.
Python
I like Python a lot because it feels like a engineering mindset. Programing is relatively fast and everything is considered ‘dynamic’ This flexibility, unfortunately, makes it slower. There are so many great open source libraries out there for Python that it’s becoming the defacto programming language for data scientists. There’s Scikit-learn, numpy, Keras, TensorFlow, etc.
It can be productionalized with pickle files and exposed as a REST API via some sort of framework like Flask, but it’s a bit trickier. Still, you can rapidly prototype data science projects with it and if you get stuck, there are a ton of communities to help you. Just visit any StackOverflow Python forum.
I use python extensively for mundane and routine tasks and occasionally do some data science with it.
Julia
I love what Julia can become. It’s a great programming language that reminds me a lot of Python and R BUT it has speed. It has a Just In Time (JIT) compiler that makes it leaps and bounds faster than Python and was designed from the ground up to be parallelized and offloaded to the ‘cloud.’ The negative for now is that it doesn’t have the depth and breadth of libraries that Python has but it’s growing.
I like that it can be integrated with a Jupyter Notebook, which makes things a lot easier to code in.
Deep Learning Libraries
Right now there are so many competing Deep Learning libraries out there that it’s hard to choose one. I personally like TensorFlow and Keras (Keras being a wrapper of three DL libraries) but Keras seems to be the dominate one for today.
Here’s a bit more advice from me on the subject of getting started in Data Science.
Pick Two, Master One
Pick two computer languages and become proficient in one and a master at the other one. Or, pick a platform like H2O-Flow or RapidMiner and a language. Become a master at one but proficient in the other. This way you can set yourself apart from other students or applicants.
The reality is that you will be flipping back and forth between languages in your day to day work life. You could be writing a Python script to connect a database. Then pull in some data and then a D3js wrapper to make make a dashboard. It all depends on what you end up doing on a daily basis. It’s never a dull day as a Data Scientist, that I can assure you.
Social Equity & Networking
I spoke about this in my video, you should get involved socially. Join meetups, go to conferences and then contribute. Did you do a cool project or solve an interesting problem? Ask to speak about it at a meetup. Public speaking does two things for you: it builds your brand, and it helps you get over the fear of speaking.
I used to ‘pooh pooh’ people with communication skills. I used to think all they do is talk and produce nothing. Boy was I wrong. Communicating is as important as solving whatever problem you’re working on.
Another way is to join a club or meetup. This is great low stress way to get out and listen to some interesting speakers in the field. There’s tons of meetups happening all the time and all you need to do is go to meetup.com and do a search in your area.
If you saw someone give an interesting talk at a meetup, go up to him or her and tell them you enjoyed their talk. Then ask for a business card or ask if there’s any opportunities at their company. Do be an annoying nudge and email them every day asking about opportunities. Check in with them every quarter by sending a nice email with an interesting article you read.
Create Something
The next way is to create something. In my past article, I wrote about about how the Makers have a drive to create. As we say at H2O.ai, Makers Gonna Make. So Make something!
Write a new library for python or R. Create new RapidMiner processes. Then share them with the world. Share them on Github, share them on a blog or share them on Medium. Doesn’t matter but design/build/code something and release it into the wild. Then cultivate it’s growth.
Become that guy or gal who’s software is being used at Google (but can’t get a job there. sheesh!)
Make and then Share!
Start a Business
This is idea is the hardest but the most rewarding. Become an entrepreneur by starting a business. It doesn’t have to be big, look at what Ugly from Uglychart is doing. He’s domain flipping and making $125,000 per month. The best part? He’s the only employee and doesn’t want to get big.
Or, you could be like the founders of RapidMiner. Build a Data Science platform back in 2007, then build build a Startup around it! The founders of Instagram designed an app and photo platform for the iPhone and sold it to Facebook. Of course they left Facebook but I’m sure they’re going to be sought after by Venture Capitalists.
The hard part with this suggestion is figuring out what kind of business to start. Are you going to be a consultant or are you going to build a product? Then how are you going to sell it (beware the Fremium Devil).
In the end, it doesn’t matter which route you choose. The most important aspect is to remain involved with a Data Science community. Read up on latest advances, write code, build things, talk to people, and build your personal brand.
Stay Hungry
The field of Data Science is always changing, sometimes on a month to month basis. Staying on top of the latest changes in the field is critical if you want to keep your skills sharp. New algorithms or techniques are found to help build better models. Businesses want to use them to squeeze out more performance and you’ll need to be ready to assist. It is important to be active in your field and keep learning new things.
Check Out My Presentation on It!
I recently gave a presentation on How to Be Successful in Your new Data Science Role on BrightTalk. I even uploaded my slides and sample Python Notebook to my GitHub Account. Check it out!
End Notes
I know that many people are attracted to Data Science because it’s one of the hottest in demand professions that pays a lot of money. Don’t be fooled by the money, it’s hard work. Your journey in Data Science will soley depend on how far YOU want to go and HOW MUCH effort you’re willing to put into it. It’s not an easy path but it’s very rewarding, not just monetarily but also career wise too.