Orlando is the 5th SplunkLive of 2010 (following events in Boston, London, Vienna and Munich) and the first ever in Florida. The event drew a capacity crowd of enthusiastic customers and users.
As usual at these events, we asked customers to stand up and talk about their experience with Splunk – how it’s used, where it helps, lessons learned and the impact on their organization. On this occasion, we had two great speakers from Voxeo and Presidio.
RJ Auburn – Voxeo
Voxeo is the world’s largest provider of Interactive Voice Response (IVR) services, supporting over 82,000-hosted ports globally and also have hundreds of on-premise deployments. Over 100,000 developers use Voxeo’s platform to integrate with their existing web applications and communications via traditional, next generation, or social networks – instant messaging, twitter, facebook, skype, SMS, voice, etc.
RJ is CTO at Voxeo and responsible for bringing Splunk in to help fulfill his mission to “make communications simple”. When asked what they use Splunk for, his response was simple, “what don’t we use Splunk for!” And indeed, Voxeo showcases multiple use cases for Splunk. More on that later, but first what does Voxeo’s IT infrastructure look like?
Logging at the Terabyte Scale
RJ spent several minutes discussing Voxeo’s global infrastructure. Their hosted IVR platform spans 7 datacenters across North America, Europe and Asia Pacific. There are over 2000 servers across these datacenters, generating approximately 1 terabyte of raw log data per day in total. These facts alone, pose significant challenges when seeking to make use of these logs and IT infrastructure data: shipping logs to a central server is not feasible due to logistical, security, regulatory, legal and privacy reasons. Add to this the need to save their data for 7 years, due to compliance and regulatory reasons and Voxeo’s policy of 100% uptime SLA to their customers, and finding a way to better manage their IT infrastructure data looked like a signficant challenge.
RJ starting looking for different solutions and eventually came across Splunk. Not only did Splunk’s distributed architecture and scalability characteristics match Voxeo’s requirements, it also fully addressed the different ways they wanted to use their IT infrastructure data:
- IT operations monitoring in the NOC – ability to see 24×7 dashboards across their entire IT infrastructure, monitor network performance, watch for trends, optimization and capacity planning.
- Troubleshooting IT infrastructure issues – when an issue does, operations teams can pinpoint root cause very quickly from one place.
- Providing developer visibility – providing the 100,000 developers on the Voxeo platform easy and secure access to the hosted platform logs, supporting a multi-tenant, scalable approach.
- Meeting security requirements and creating reports to meet compliance mandates – providing the ability to provide visibility of all security-relevant data and also to meet compliance and regulatory mandates, such as PCI, SOX, HIPAA, ISO 17799 and Gramm-Leach-Bliley.
Splunk’s distributed architecture is deployed across all Voxeo’s datacenters, providing secure and rapid access to logs and IT infrastructure data, whilst avoiding the need to ship data around.
Splunk’s scalability model is based on MapReduce, which scales linearly across commodity servers to absorb the growing transaction and data volumes. Splunk also integrates to Voxeo’s single sign-on architecture to provide a seamless experience for external customers and developers using Splunk. Voxeo makes Splunk’s ad hoc reporting as a value add capability embedded in their hosted offering.
More recently, Splunk is also embedded in Voxeo’s on-premise product and integrated into their management console (Prophecy Commander). Providing a replica of the hosted architecture, but for an on-premise environment from a single laptop to a large datacenter.
Final note – RJ’s complete presentation delivered at SplunkLive Orlando is available at the following link (thanks RJ!): http://www.slideshare.net/voxeo/logging-at-the-tb-scale-voxeo-at-splunklive
David Winters – Presidio
Presidio, Inc., is a diversified professional and managed services firm and recently merged Coleman Technologies, a leading IT and systems engineering firm, providing, amongst other things, information technology and systems engineering services, “We manage outsourced NOCs”. Their NOC environment includes Linux, Windows and Cisco equipment for unified communications.
David joined Coleman Technologies 5 years ago, heading up their managed services group and specifically building the NOC practice. Here’s his version of events, “if you complain enough, you eventually get responsibility and I ended up running the NOC!”
Reigning in the NOC
David’s immediate pressing issue was in helping manage the data deluge. David used Zenoss in the NOC for fault performance monitoring, “Zenoss is great for displaying row-by-row information on the screen, like SNMP traps, syslog and threshold alerts, but the screens didn’t scale as the NOC operations scaled. They found that as they added more customers, more devices, more systems and more advanced technologies, important things simply got pushed off the bottom of the screen.”
He said, “we simply did not have the physical real estate for eyes on glass, to see all the important messages and see what’s going on. This is a big problem”. David and his team then deployed Splunk to manage the low level, high volume data and find problems, which can then surfaced via Splunk dashboards to the NOC.
“Splunk Makes Me Sick to My Stomach”
Let’s explain this somewhat controversial statement. In David’s words, “when I started Splunking the data and seeing what we were missing using our traditional fault performance systems and how we could correlate it and show it dashboards, I literally went home sick to my stomach, not being able to sleep – and then incessantly began using Splunk and finding the silliest errors there were vastly widespread in customer environments – fans and routers that were stopping, duplex mishmashes, VLAN tags that were incorrect. Easy to fix problems that nobody knew about!”
Bringing Important Messages to the Forefront
David and his team sees Splunk to filter out noise and map severities to different message types from custom and packaged applications. Lower level events are Splunked and now David is now able to catch critical issues as they are building – see the frequency of the issue occurring, how many locations it’s occurring at, a break down by field extractions and line of business. By doing this, David and his team obtain actionable intelligence they can respond to quickly. He really liked how level 1 NOC operators can create custom dashboards for specific customers to monitor known issues and without involving development teams.
Final word? Even power users of Splunk get value from Splunk Live! events. David said that after learning more about dashboards in the product demo, he built three of them during the session!