Over the last year and a half, the phrase “big data” has exploded into public awareness. Simple Google searches on the phrase show it to be a hockey stick in terms of citations (http://blogs.splunk.com/2012/04/12/some-big-data-this-way-come), and to be more popular than “Barack Obama”. What does that mean, if anything, for how we educate the students of today?
Not too long ago, I found myself in a situation where I needed some help. I had a house full of stuff that needed to be boxed up and stored, but I didn’t have the time or energy to attack the tasks of sorting, packaging, labeling, and storing. Thankfully my mother had the time to step in and help out. We went to the local office supply store and bought about one hundred standard file boxes. Once home, she calmly started at one end of each room and filled the boxes one-by-one until the room was empty. Each box was labeled with a date, a unique number and a few key words related to contents, then stacked against a wall of the garage. After a few days, everything was boxed, labeled, stacked, and ready for storage.
When I think about big data, I think about those boxes. Nearly every student of technology has heard of big data, but very few of them know precisely what it is and where it came from. Can anyone predict how many photos on Facebook their closest friend will tag in the next 24 hours, or what search terms they might look for on Google? If these are digital footprints, where do they get stored, and how? Even if you tweet about a friend’s puppy, digital information is still digital, and demands rigorous treatment to be useful, reproducible, and ultimately valuable to consumers and the companies who handle it. Relational databases are the foundation upon which the world of business has built its productivity gains over the last 50 years, but as any database administrator can tell you, they do not like unpredictability of size and structure.
There is one thing we can know about all digital footprints, however: there is always some unique value associated with a specific stream of data. It could be a username or a MAC address or a cell phone number or social security number along with a clickstream or log file or call duration or clinical diagnosis. What we do NOT know is the structure, length, or content of the data that will be associated with these key values.
Back in the early 2000’s, Google and Facebook and Yahoo started looking at ways of taking raw data streams and sticking them into blocks, and labeling each block with an identifier and some metadata about what was contained in the block (much like my mother with those boxes). This technology called MapReduce (http://en.wikipedia.org/wiki/Mapreduce) allowed data of unpredictable size to be stored in a systematic fashion. Mapreduce coupled with the idea of key-value pairs – ie, we know the key is unique but have no idea as to the size of the values associated with it (http://en.wikipedia.org/wiki/Key-value_pair) – allowed these companies to get a handle, a scalable-enough-to-keep-my-sysadmins-from-quitting handle, on the huge volumes of data that the consumerized internet was generating.
As compelling as this human-generated big data is, the future of big data is much more about what machines are generating behind the scenes than what people are tweeting about on Twitter. In the age of everything-in-the-cloud and bring-your-own-device, the multiple technology layers needed to provide for seamless end user experiences is generating machine data several orders of magnitude faster than you can tweet about your puppy. Thankfully, the same technology can be used to get valuable operational intelligence from this multi-layered machine data, which is essentially unstructured and unpredictable textual data with key-value-pair characteristics.
But what about those students we mentioned in the opening paragraph? What are they learning TODAY about what big data is, where it came from, and where it can take us? How many undergraduate students know that MapReduce and key-value-pairs can be used to build scalable, useful non-relational data stores to solve core businesses problems for the companies of today and tomorrow? Not enough, in my opinion. As a technology community we need to look for ways to get today’s students exposed to these ideas BEFORE they graduate so that they can add immediate value in their respective post-graduation jobs, like this class at San Jose State University is doing http://spartandaily.com/87322/new-class-specializes-in-wrangling-big-data and http://blogs.splunk.com/2012/11/28/big-data-students-present-their-splunk-projects-at-hq. A recent analyst report cites severe future shortages for data scientists in the 100k to 200k range, and data-savvy manager shortages in the 1m to 2m range. I tend to agree, which means it is time for me to stop writing and get back out there, talking to anyone who will listen about my mom and those boxes.