Earlier this year, I had the opportunity to meet with administators and educators from San Jose State University’s College of Sciences to discuss the idea of a Big Data program that would include courses for data analysts, systems administrators, and interdisciplinary teams. This brainstorming session led to the inception of an experimental CS course offered for the first time this fall at SJSU: CS-185C, “Big Data Processing.” The SJSU course catalog describes the course thusly:“This course will have a very practical focus on the techniques and tools for capturing, storing, processing and analyzing big data. Tools such as Hadoop and Splunk will be used in virtual, cloud-based environments. There they will process and analyze, either in batch mode or real time, big data that will range from web log files to twitter and other specialized data. This is a hands-on course in which students are expected to work in teams to complete at least 2 real world projects, which are the sole components of the grade. Guest lectures by current practitioners are expected.”
A few weeks ago, I was able to attend a session of the new class to watch the student’s first round of big data presentations, and it was great! The students were given access to some data sets and hosted computing instances in GoGrid, and were asked to deploy Splunk, get their data into it, and do some analysis, which they then presented to the class. Three of the presentations in particular were so impressive that I asked the student teams if they would be willing to come up to the Splunk offices in San Francisco and give slightly expanded versions of them there for our entire staff, which happened yesterday.
We filled our biggest conference room with Splunkers (including several members of our executive team), and dozens more attended remotely. Peter Zadrozny, Splunk performance consultant and instructor for the course, told us a little about the project requirements, and introduced the teams.
The team of Vladimir Serdyukov, Gayathri Vijay, and Dhivya Srinivasan presented first.
They chose to analyze several years of connection and chat data from a custom Neverwinter Nights game server module, revealing some great insights about abuse of DRM in the game, the geographic location of players, as well as some sentiment analysis of the in-game chat, which made use of the free Sentiment Analysis app available on Splunkbase. In the process, they uncovered a bug in the app, which has since been fixed :).
Jason Campos presented his project next, which covered a single day of Twitter data (3/26/2012, to be specific–360GB of data, comprised of 375 million tweets plus their associated metadata). He began his presentation with the assertion that in addition to doing some basic analysis of hashtags, busiest tweet times, and popular platforms, he had made a “groundbreaking discovery” that he would share with us. Jason showed us a diagram of his Splunk deployment:
and described the process he used to normalize the Twitter data using dynamic lookups to filter or de-duplicate content –for example, he condensed all Twitter clients for a given platform (such as Android, iPhone, and Blackberry) into a single value for the platform. He showed us data that might indicate that Android users are boring:
Lastly, his groundbreaking discovery:
The third and final project was presented by Scott Blake and Mangesh Musale, and covered the last 100 years of worldwide earthquake data. In addition to learning a lot about the complexities of processing data from a variety of sources, Scott and Mangesh dispelled the myth of an “earthquake season”:
Of particular interest was a series of earthquake clusters in Arkansas in the last few years :
…which were difficult to explain…until they did some research and learned that many of them may be the result of natural gas ‘fracking‘ in the vicinity.
After the presentations, the student teams spent some time mingling with Splunkers and got a tour of our offices…
…and of course we couldn’t let them go home without some of our infamous tshirts :).
I can’t tell you how pleased and proud all of us at Splunk are to see such great projects and analysis from these students–all of this work was accomplished, start-to-finish within about 4 weeks of a college student’s schedule. I’m very much looking forward to next semester’s presentations!