Couple of weeks ago, we talked about the need to appropriately invest in people, when you invest in technology. I wanted to continue the discussion and focus on the new area of “Big Data” – more specifically the analyst who works on big data – the “Data Scientist” and the data analyst.
I love the term “data scientist”. It has finally made the data junkie’s job title more glamorous. It has given both name and fame to the role. Well everyone is talking about “big data”. Many organizations think hiring a data scientist is requirement for solving all “big data” problems and the only analyst required with a big data problem are data scientist. If you have invested in big data (Hadoop, Splunk etc.), do you need a data scientist? My goal for this post is to dive in a bit deeper and help you understand, as well as make the right choices. Being a web analytics practitioner for number of years and having experienced the journey from being an analyst to managing analytics teams at companies like eBay, I would like to share my experiences and hope you will benefit from this discussion. This post aims at addressing all types of datasets – small, big, huge. I am going to focus on online business, as I have a better understanding of online than other areas where big data is used:
1) Online platform companies: Online platform companies thrive on great products. These products mostly involve building compelling interfaces that are mostly enabled by data. The “apps” or modules within the sites use mathematical models or algorithms to drive user engagement or stickiness for those modules/apps.
2) Online channel business: eC0mmerce or content sites rely on deep understanding of data to drive user engagement and product optimization – ultimately driving higher conversion on the site, user engagement and revenues from the online channel. Optimizing user acquisition and retention is also very important goal for these organizations.
Successful organizations thrive for the ability to embed data in the products, decision making process, and drive optimization across the online properties.
Before we get started, let’s define a data scientist. A simple explanation from DJ Patil who co-invented the term:
“A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.”
While a bit comprehensive is from Jake Porway, Data without Borders and the New York Times
“A data scientist is a rare hybrid, a computer scientist with the programming abilities to build software to scrape, combine, and manage data from a variety of sources and a statistican who knows how to derive insights from the information within. S/he combines the skills to create new protoypes with the creativity and thoroughness to ask and answer the deepest questions about the data and what secrets it holds.”
Many of the conversations on social media sites and job descriptions lean towards an understanding that a data scientist is a good analyst, is not afraid to deal with data, brings new perspectives and combines analytics with statistics – building algorithm and data products.
So do you need a “data scientist” for every “big data” problem? Not really. The algorithm, data mining or advanced statistical modeling pieces represent 10-15% of all analytics needs within the organization. There are many important analytics – product optimization, site testing, user experience optimization or measuring online channel performance that most organizations need to focus from an analytics standpoint. Skills needed for these types of analysis rarely need algorithm development or advanced statistical skills. Mostly, data scientists work on futuristic products; data or web analyst work on current product – measuring the effectiveness of site or user in real-time and correlated with various data sources to optimize the business.
From a skill set standpoint – Data Scientist need strong data skills, analysis skills, strong knowledge of statistics and ability to program algorithms. A data/business/web analyst on the other hand is not expected to having programming skills to build algorithms, but needs strong SQL skills in addition to good understanding of analytics packages. Both of them need to be passionate about data and have a high level of curiosity – often questioning the data to derive new insights from the data….nothing short of a Data Ninja! Lastly, every analyst needs to be able to tell and sell his “story” from the insights.
A good approach to your “big data” analytics staffing plan is a good 80/20 rule. Staff 80% of your resources in data/business/web analyst and 20% on data scientist. You will also be able to create a carrer path for your best data analyst to become a data scientist. As an organization, the best bet is to provides tools and technology that will reduce the data movement, manipulation and data acquisition effort. This will allow the data scientist to focus on the value added analysis that can move the needle for the business.
I will leave you on a simple analysis for available jobs in US in the big data or analytics space. Clearly jobs for “big data” and Hadoop dominate the space, Data Scientist roles are few (right now), but over time it will increase. The chart below is from available jobs in US posted in Linkedin.
I hope this post has provided some ideas on how to approach the human side of analytics for big data problems. Did I mention that this and other interesting discussions will happen at Splunk’s 2012 User Conference? Come and enjoy the data journey
PS: A good in-depth read on building a data scientist team is here.