L’Abeurge Del Mar was the dazzling oceanside venue for SplunkLive San Diego, April 2010. The full house of attendees included customers and users from the Federal Government, innovative startups and Fortune 500 companies. Hats off to our logistics team for selecting such a stunning venue and to our two customer speakers for equally stunning presentations:
Robert Roth, Senior Director, Service Engineering Team
Sony Network Entertainment
Robert is the Senior Director of Service Engineering at Sony Network Entertainment, the group within Sony responsible for the global delivery infrastructure of all their digital and online products – movies, music, PlayStation network games, etc. Robert is responsible for the infrastructure, ensuring the availability, performance, scalability and quality assurance of the customer experience and also ensuring the architecture keeps pace with the business. Robert is a veteran in the online consumer space and first deployed Splunk in his previous role at Intuit where he ran the Central Engineering group responsible for SaaS and e-commerce platforms for QuickBooks and TurboTax.
Billions of Events, 400% Growth!
The Sony online business currently has 40M active users globally, with a 400% increase in annual traffic. Their online services include games, movies, music and other content, each of which require the ability to authorize users and in some cases support micro-transactions. For example, within an online gaming experience, users are allowed to purchase items for use in that game.
With nothing but growth in sight, they wanted a solution that would deliver the operational information required across billions of events and cope with their data volumes today and tomorrow. According to Robert, “Splunk was the best solution we saw to look at all the disparate data and make sense of all of it from one place”.
Running a Global Infrastructure
Sony Network Entertainment’s global infrastructure spans multiple geographical datacenters in the US, Japan, UK, Australia and other divisions across the globe. The server environment is all Linux, with thousands of servers supporting a growing catalog of content and device ‘types’ using the platform. Robert stated that their vision is to have, “any Sony device be able to connect to the global platform and allow users to upload content as well as consume it.” Their global infrastructure is organized around multiple parallel production lines. Each ‘line’ supports the ability to connect 3rd party content providers and QA services, validate integration to the store and finally production. This is illustrated in the following slide:
Correlating Across Billions of Events
Robert and his team use Splunk to correlate transactions and events across multiple disparate data sources. Robert talked through a few examples of using Splunk:
Transaction tracking: the traditional homegrown tools and Linux commands like ‘grep’ were ineffective. With purchases going through 20 different layers of infrastructure and multiple partner interactions, they use Splunk to slice through all their infrastructure data and provide rapid visibility of all activity across all layers of the infrastructure for a specific transaction or IP address. Robert said, “Splunk monitors our monitoring systems and alerts when something is up, closing any gaps that previously existed.” With Splunk Robert’s team can correlate recurring issues across regions, performing real-time triage and device centric analysis. All of this finds issues before they impact customers and services, speed customer service response and allow for architecture changes without impacting operations.
Capacity Planning: with 400% traffic growth year on year and an expanding number of datacenters, Robert’s team needs visibility into where their global platform is getting most used and therefore where they need to grow. They use role-based access controls in Splunk to set up privileges for less technical users to provide secure visibility into how the systems are executing and perform the necessary capacity planning analysis.
Quality Assurance: with millions of unique visitors per day across four primary global ‘regions’, Robert’s team has to deliver a true 24/7 platform and meet the high levels of customer experience expectations that exist today, with a user base used to an always on, broadband, interactive, media-rich user experience. Sony does a lot of work with content delivery networks and need to ensure they are functioning correctly and at the right levels of quality, as well as delivering content successfully to users. “Hundreds of orders come in per minute, and network enabled content must ‘talk’ to our servers to function – applications must work.” Robert’s QA team also, “leverages Splunk to validate correct testing behavior and to ensure each transaction is flowing everywhere it needs to flow—in a secure way.”
Customer Behavior Analysis: Robert stated that, “Splunk helps us mine data more effectively and provides us with a better understanding of purchase decision behaviors. It’s used for analysis for recommendations based on customer behavior and also as a value-add for partners.” Partners and developers are a critical part of bringing fresh new content and services and leveraging Splunk gives partner-developers the performance data critical for their success.
Unsurprisingly, Robert’s presentation stoked the audience who had quite a few questions to ask about Sony’s implementation of Splunk. When I asked him at the end of the presentation how he describes Splunk to the rest of Sony, he responded with, “Splunk is the solution we use to understand what’s happening in our infrastructure regardless of where or when – it’s the one place we go to first”.
Ron Broersma, Chief Engineer for IT Division & Network Security Manager
Large US Defense Organization
The US DoD group Ron represents (“The Group”) is responsible for acquiring, developing, delivering and sustaining decision superiority for the war fighter at the right time and for the right cost. In Ron’s words, “we buy stuff for the US armed forces, such as command and control systems, communications, IT systems, and more”. The Group has a very large R&D component across 12 sites around the world, so they need to run and maintain their own network to support this environment, which is their principal use of Splunk.
A veteran of 25 years in the DoD, Ron occupies two main roles. The first is Chief Engineer for the IT division i.e. the technical responsibility, which includes implementing all the firewalls, IDS, IPS, entire security architecture, network architecture, the VPNs, the enterprise network, the intranet, etc.
His other role is Network Security Manager, which includes responsibility for the enterprise architecture, network standards, integration of technologies and evaluating best-of-breed products – what works, what works well, where is the industry going, what makes sense to deploy to the fleet, and different parts of the DoD. This was where Ron first came across Splunk.
Ron covered some of the unique challenges his group faced. “The DoD is a military organization with a lot of compliance mandates. It’s also a prime target for hackers and other adversaries from around the world, so there is an extremely large security focus to protect the network, users and customers.”
He described the numerous, heterogeneous data sources they needed to harness to secure, manage and audit IT – ranging from next-generation to legacy appliances and equipment. These included the Google search appliance, Netscreen firewalls, Tipping Point IPS, Marconi ATM switches, Ascend dial-up terminal servers, Juniper, Cisco, Brocade routers and switches, Cisco and Juniper VPN appliances, Aruba wireless controllers and Blue Coat web proxies.
Another challenge Ron talked about was the move to IPv6. For the uninitiated, IPv6 is the new IP protocol mandated by the federal government for compliance across all organizations and is designed to replace IPv4, the Internet protocol currently deployed and used most extensively throughout the world. Ron stated, “the challenge is in an environment with mixed IPv4 and IPv6 equipment, as well as other networking technologies such as ATM switches, how do you normalize all the log data so using it is seamless?”
The Information Operations Condition (INFOCON) threat level system is a defense system based primarily on the status of information systems. The move to INFOCON 3 meant increasing the frequency of validation of the information network and its corresponding configuration. In Ron’s words, “the move to INFOCON 3 resulted in new requirements, finally turning our use of Splunk from a cool toy to a serious production solution.”
Correlating Across Billions of Events
Given their environment, data sources and pressing needs, the decision to use Splunk was “pretty simple”, according to Ron. “It was pretty amazing once we saw Splunk and what it could do – the schema-less environment, where you just plop on a lot of data and the level of the abstraction Splunk delivers – it’s perfect. It does what you want it to do and without a lot of training.”
Everything From One Place
Ron described their use of Splunk, “Splunk pulls log data from multiple data sources, centralizing it and normalizing it, enabling us to search, alert, monitor and report on it from one place.” Ron remarked, “our implementation of Splunk is out-of-the-box. We didn’t have to build any special parsers or adapters for the different data sources – from new equipment, to stuff deployed in the 80s. We have been able to consolidate years of log data in one central place.”
Ron remarked that IPv6, “was not a problem at all – Splunk automatically understands the data and enables us to search and analyze IPv4 and IPv6 formats seamlessly.”
For his network security teams, Ron said, “a lot of the tools used until now were homegrown/homemade tools – grep-based, perl-based scripts, and Splunk was able automate that out of the box, being used to monitor for anomalies and drilldowns, looking for IP addresses, etc.”
He also is using Splunk to replace vendor-specific network management tools, which he said, “are pretty good for their own product environment, but in terms of good deep analysis, they’re slow and clunky. We now use Splunk which is a lot faster and has meant we can meet INFOCON 3 requirements.”
His team did evaluate using a SIEM to pull together the data they needed and correlate it, but he said, “it’s just a lot of work – very difficult, building a ton of collectors, getting all the field normalized – very complex and a lot of work”.
Deriving New Intel
Ron went through an example of how they find new user intelligence with Splunk. They had an issue, which required them to Splunk their Google appliance data to diagnose a server issue. A by-product of this was seeing that they could profile the different kinds of searches being done. This visibility enabled them to optimize the environment for the most popular and critical searches. “In just 2 clicks, we created a report to provide immediate feedback on what we need to tune to improve the speed of critical website searches.”
Ron wrapped with some pretty positive feedback on Splunk. “Splunk was just too easy. It just works. In all the decades I’ve been in software, I was amazed at the quality of the software. It’s done cleanly, done correctly, and it’s very intuitive. Once you see it, you realize it’s a whole new ball game. It’s got us re-thinking how we should manage our data, rather than buying point solutions. It’s very useful very quickly. It’s basically a paradigm shift for us.”
Perhaps the best example of the paradigm shift Ron talked about is his response to a question from the audience, “how long did it take you took to set up Splunk out of the box?”
Ron responded with a recent issue he was faced with where he deployed Splunk. “About half of our network switches are from one vendor and managed with the vendors’ management tool.” His team were getting frustrated with the sheer difficulty in searching for anything, the length of time searches took based on a traditional database architecture. “It took a lot of horsepower to get any speed out of searches.” Finally an issue arose which meant Ron had to act quickly. “I deployed a new instance of Splunk, pointed all the syslog data at it and was able to diagnose the network issue being experienced.” And amazingly, Ron did all of this while he was at a two-hour layover at Denver airport. According to Ron, “It was literally that easy.”
Towards the end of Q&A, a member of the audience raised their hand and asked half-skeptically, “all of that you just went through was really out-of-the-box Splunk?”, to which Ron replied with a simple, “yes”.