SplunkLive Seattle Kicks IT

On what was an incredibly beautiful day we had more than 100 Splunk devotees attend our first ever SplunkLive event in Seattle last week. In the shadow of Microsoft we talked about our Windows and Microsoft strategy and compare notes with lots of customers that are running mixed Microsoft, Linux, Solaris environments. Many of our customers with Microsoft Active Directory, Exchange and SharePoint environments are utilizing Splunk to troubleshoot problems and implement security and compliance controls in large-scale, distributed environments. But, I’m still surprised at how little Microsoft .NET we’re seeing in production large-scale applications.

Three Seattle-based customers presented their views on managing mission critical applications, IT data consolidation and Splunk.

  • T-Mobile USA
  • Blue Nile
  • Washington State University

T-Mobile USA

Sean White, Senior Engineer with T-Mobile Operations in Bellevue talked with us about their global rollout of Splunk. Sean is a member of the security engineering team charged with incident response, IDS, vulnerability scanning, anti-virus and enterprise unified logging. He graduated with a B.S. in Computer Science from University of Kansas and has a deep background in large telecom environments initially as a system administrator and webmaster, SS7 network C&C and performance, engineering and now in information security. Sean has been at T-Mobile for 4 years, prior to that at Cingular, AT&T Wireless. T-Mobile USA is the 4th largest US national provider of wireless voice, messaging, and data services to 34M subscribers with annual revenues of $17B. T-Mobile USA is the US operating entity of T-Mobile International AG, the mobile communications subsidiary of Deutsche Telekom AG (NYSE: DT). Deutsche Telekom is one of the largest telecommunications companies in the world, with nearly 120 million customers worldwide

It all started with PCI Compliance

Like many of our enterprise customers, T-Mobile started working with Splunk in one area but quickly saw the value of expanding into others. For Sean and his team, PCI Compliance was the beginning of the Splunk solution footprint, but soon everyone realized the consolidation of logs, events, messages, configurations and changes meant a whole lot more.

Beginning with proving PCI compliance, T-Mobile has very specific requirements. PCI Section 10: Track and monitor all access to network resources handling cardholder data. But in T-Mobile’s case scale was a big issue. Fulfilling PCI DSS Section 10 meant tracking 26+ in-scope applications and the ability to trace transactions from start to finish across 650+ servers running Windows, Linux and Unix varieties. It also means more than 100 individuals logging into Splunk on a daily basis as part of the process.

The Splunk Set-up

The Splunk configuration consists of

  • Pairs of forwarders set up in each of 4 geographic locations.
  • Three short term indexers + 1 short term search box.
  • Three Long-term search boxes hooked into a 32 TB NAS.
  • Centrally controlled from a single deployment server.

The current installation is indexing more than 600GB/day of data and has just passed the 10B event mark. Controlling access to all this data is critical and T-Mobile has Splunk roles set up for managers and application teams to limit access to subsets of the data. The ability to segregate data access along lines of duties is critical to prove PCI compliance.

The Business Case for a SOC

In addition to proving PCI Compliance, T-Mobile has discovered Splunk’s use for Security as well. Not long ago, a SIEM vendor would have told you IDS and firewall logs were all you need. That >=2 sources of data == correlation. Not so much.

“All the best new vulnerabilities are coming in on the application layer.”
– Sean White

Enterprise logging—visibility into all of your IT data—is absolutely critical in defending against modern blended attacks. At T-Mobile Splunk has become a primary analysis tool for deciphering what is happening to the applications, servers and devices on the network. A few saved searches and Splunk helps does real correlation.

Nothing Boring about Logs and IT Data!

PCI Compliance mandates gave T-Mobile the excuse (read funding) to start an enterprise logging initiative. Logging all security, network and application events can truly give insight needed to not only measure and report on compliance controls but also to run a more secure and effective business. PCI has also discovered that integrating the ability to ask any question of their environment and get immediate answers also provides a pile of value to the help desk operations and better business intelligence functions.

“All the information about your company is in your logs—there’s nothing boring about it.”

Blue Nile

Jerry Brennock, Director Core Development at Blue Nile explained how the company is using Splunk to improve the experience of buying diamonds over the Web. Blue Nile, Inc. is an online retailer of diamonds and fine jewelry offering in-depth educational materials and unique online tools that place consumers in control of the jewelry shopping process. Importantly, the focus is on giving customers a great experience at a a great price – this translates to requiring high quality at a low cost. Jerry’s team team builds and support the infrastructure and applications for merchandising and marketing, including the website. He’s been with Blue Nile for 10 years and in the e-commerce space for more than 17.

The Killer Diamond App

Diamond Search is undoubtedly the killer application for Blue Nile’s E-commerce experience. It’s an asynchronous javascript app that has to work across any browser and there are many non-obvious use cases. All three of these factors means it is prone to failure in lots of edge cases.

“If this application isn’t fast and accurate, we don’t sell diamonds.”
– Jerry Brennock

Jerry’s team has embedded tracking pixels with name value pairs to track JavaScript profile information from each diamond search. This together with Web server 500 and 404 errors give the development, operations and customer support teams all the data they need to troubleshoot problems. The challenge is finding customer problems “in the moment” before the sale is lost.

Centralized Monitoring and Alerting with Splunk

In order to respond quickly the development, QA, operations and customer support teams needed a centralized, consolidated view of all Web logs across the infrastructure. In addition, the existing custom error alerting system was fragile and error prone. The Splunk solution was designed to collect logs and events in real-time and provide searches, alerts and notifications.

“If we solve a problem in one minute versus 30 minutes during a peak hour – Splunk pays for itself.”

Real-Time Customer Service

The most important use case driving Blue Nile’s retooling with Splunk is Customer Service. Superior service is a key driver of the company’s growth. Repeat and referral business is very important in a high end E-commerce business like selling diamonds.

‘With Splunk we can now contact customers intelligently, “We See you are looking for a 1.5 carat diamond and noticed you are having a problem with Internet Explorer…” this gives our customers intelligent service and let’s them know we’re not wasting their time.’

Sometimes alerts start firing immediately after a new code release. QA can react quickly using Splunk to research issues. This allows them to very quickly identify and correct edge cases that are difficult to catch in non-production environments

Low Barrier Reporting

Initially reporting with Splunk was seen as just an extra bonus. But, Splunk made ad-hoc reporting so easy we started publishing saved searches to understand which site features are valuable to customers and partners.

  • How many customers have active RSS feeds? Which readers?
  • How many partners are using that new pricing report?
  • How many customers actually scroll down in diamond search? How often?
  • How many partners are using that new pricing report?

One example here shows how many partners are using that new pricing report.

eventtype=”XNet” (BNF_http_filename…”) starthoursago=24 | rex field=vendid “(?[^0123456789%]{2,})” | sort bn_vendor_name | chart count(bn_vendor_name) by bn_vendor_name BNF_http_filename

Lessons Learned

Jerry’s team has been using Splunk extensively as their centralized monitoring and reporting solution in the data center. They like how Splunk seamlessly transitions from alerts to research and troubleshooting mode. A few tips from his team.

  • Use event types and named fields to increase accuracy in your alerts
  • Think about Splunk not just for investigation but alerting and reporting.
  • Long-term trending analysis compliments real-time monitoring over time.
  • Saving searches is a great tool for internal training of operations, QA and support personnel.

Washington State University

JJ Warren is an Oracle Database Administrator at Washington State University and a super sharp Splunk expert. JJ has been working with Oracle databases for 10+ years and has been a SQL Server DBA for various projects like the WSU data warehouse. He is the principle DBA and developer for many large private projects (Brownfield/Superfund sites, Marketing Research, etc.). JJ’s core roles involve security, performance tuning, and assisting with database/application development and he’s been known on occasion to dabble with networks and security (VPNs, firewalls, SNMP monitoring).

Washington State University is a land-grant university that provides world-class education to more than 25,000 students statewide. Founded in 1890, WSU’s statewide system includes campuses in Spokane, the Tri-Cities, and Vancouver, regional learning centers, extension offices in every county, and distance degree programs accessible around the world. U.S. News and World Report consistently ranks the University among the top 60 public universities.

We Needed Centralized Logging

The WSU IT team, like most enterprises, works in various silos:

  • Networks,
  • Security,
  • Operating systems,
  • Servers and
  • Infrastructure,
  • Critical Applications and
  • Mainframes.

But, there was miscommunication, misinformation and limited access across teams to solve broad problems.

“It is difficult to properly tune, secure, and help developers when you can’t properly see all the forces acting on your environment.”
– JJ Warren

IT process improvement became the main focus to improve quality of service and reduce cost of running operations. The IT teams put together a number of process improvement goals including:

  • Ability to track E-mail MTA activities end to end across all mail systems (Barracuda, Sendmail, MSFT Exchange).
  • Ability to track Web-based sessions for single sign-on among various Web servers (Apache, IIS).
  • Ability to track home grown application transactions end to end utilizing custom log and event formats.
  • Making available logs and events that aren’t sent off hosts over the network to the various silos with access controls.
  • Ability to track response times for services from end to send.
  • Develop standardized reports across the silos and schedule regular delivery.

Why Splunk?

JJ is very passionate ability IT process improvement, the roles IT data plays in process improvement and Splunk. He offered up some excellent reasons why WSU chose Splunk.

“Other vendors offer canned reports, but to truly understand our environment—and get up and running quickly, Splunk was the best answer.”

The Results

Every IT system administrator (more than 40 people) are now using Splunk. Regex searches on the syslog server would have taken minutes to hours to write properly, run and report. It now takes seconds with Splunk. Splunk has become the proactive alerting system of choice. Now the WSU team can have multiple people jump on issues right away.

“Now multiple people can jump on issues. We’re no longer stovepipes but a much more effective team.”

What’s Next?

Next JJ and his team are working to provide custom and saved searches to a broader audience and implementing indexing of application data to give developers new troubleshooting power and integrate development more closely with production operations. WSU’s goal is to have Splunk on every server and every network device.

“Splunk is a best practice for our IT department—it’s embarrassing if it’s not in place somewhere.”

Posted by