SplunkLive! Princeton 2009

Wednesday and we’re at SplunkLive! Princeton, NJ. What an awesome place. Princeton is home to a great university and some great culinary experiences. Check out Mediterra — an interesting mix of Italian and Spanish influences. Apparently it’s where all the Princeton parents treat their kids to dinner when they are in town. Next door to our venue was the great hope for the state of NJ — a new Governor. The current Governor has turned the state budget and tax base into toxic waste. Well things went much better for the more than 60 SplunkLive! attendees in Princeton today, who gained insight into how a number of large Splunk customers keep their mission critical applications running in a time of IT budget slash and burn.

Matthew Stevens, Director Software Systems and Architecture at a leading provider of communications products and services, provides guidance to executives in his organization on mission critical media systems and strategic systems architecture. The company is one of the country’s largest provider of cable, high-speed Internet and digital voice services.

Comcast Developer Network

Matthew’s latest project is the organization’s Developers Network a secure web services platform for the development of cool new media and entertainment offerings. The Web Platform environment generates of billions of software events each day from caching and load-balancing, origin application servers, databases, middleware and content delivery networks for images and video streams. These services demand high quality. Much of the content is exclusive and premium services drive revenue. Interfaces between technology components (applications, delivery platforms) need to adhere to best practices to ensure the highest degree of end customer experience.

Why Splunk?

Th company has acquired many system and application management platforms over the years, but nothing was providing the team with the robust information from operational telemetry the teams around the company need to ensure data integrity, stability, application quality and efficiency. Several efforts specifically drove them to consider and deploy Splunk.

  • Product rollout: The team wanted the ability to predict and correct potential issues before going live into into production—Splunk has become a required best practice for new product rollouts.
  • Network/ System Integrity: Understanding security and user experience across a very large network and set of systems is a must to protect the business. Splunk provides the insight the network and system teams need across many different silos of technologies.
  • Business Intelligence: Having immediate access to real-time events and historical trends allows the various business teams to react quickly and adapt to changing customer behaviors.
  • Agility: Alerts and Dashboards indicate discrepancies so distributed teams can investigate immediately and remediate failures and attacks.

Video CDN/CMS Performance

“In content management systems and delivery networks a devil walks the long tail. If you’re facing concurrent hits across the tail of the curve, sharpen your pencil, you’ve got problems!”

Splunk helps the organization understand the risks of instability in their systems, especially during periods of high concurrency. Through pre-production modeling of even patterns and subsequent monitoring of these patterns Splunk pays for itself by helping us avoid deployment of vulnerable systems, downtime, and upset customers.

Predicting System Imbalance

The organization has successfully used Splunk to evaluate potential infrastructure vendor’s solutions and determine if they will balance loads properly across a large, indeterminate infrastructure. Often the answer is no as illustrated here in a Splunk report of resource utilization across various services.

Splunk has also been utilized to see whether solutions will be resilient to different traffic patterns, helping the company perform predictive analysis before making critical infrastructure investments.

Load testing is performed during non-peak hours and the results are analyzed for system failures over time using the telemetry data Splunk can correlate across various logs, messages and events.

When failures are found the team uses Splunk reports to dig deeper into the data.

Security and Compliance

In addition to operations use cases, Comcast security and compliance teams leverage the consolidated logs across data centers to enable faster threat assessment and security monitoring.

  • Monitoring for bad actors to trigger alerts,
  • Conducting threat detection over time,
  • Detecting attacks/vulnerabilities in systems and
  • Auditing systems in support of security assessments and compliance.

What’s Next?

Next up for Matthew and team is the launch of a new platform enabling a network of developers to create content for the network. Some of these developers are already using Splunk in their own managed services like Mashery. The organization is working to hook the Mashery Splunk installation to their own in-order to provide visibility across multiple services and providers of content and entertainment functionality.

Chris Abboud manages the Enterprise Systems Management team at Dow Jones — monitoring customer facing infrastructure and applications. Dow Jones provides global business news and information services to millions of consumers and enterprise media groups. Keeping these revenue generating services running 7x24x365 is the highest priority. Chris also manages the DJ service management platforms (Remedy, Knowledge Base, etc.) He’s been with the DJ organization for 10 years, in current role for 3 years.

“Our mission is to address issues before they become service impacting events. Failures are going to happen — we need to make sure people know about them as soon as possible.”

The Splunk Set-up

The Dow Jones Splunk installation includes

  • Data from 6000+ servers globally,
  • 13,500 + source types,
  • 1,700 network devices (primarily Cisco and Juniper) and
  • Ten distributed Splunk servers in difference geographies index ~100GB a day and provide a new global logging console.

Why Splunk?

Each Dow Jones command center now has the ability to know what’s happening before customers do across a wide range of internal and external services. Splunk speeds the time to resolution for email outages that may impact internal users’ productivity and editorial sites downtime that can directly impact to customer service and revenue. Dow Jones has found Splunk generates significantly fewer false positives than traditional monitoring systems and new resources are much easier to manage and deploy.

Per server monitoring costs have dropped by a factor of 5X

What’s Next

Next up Chris and Dow Jones will be checking out the Blue Coat and Cisco Apps as they turn Splunk onto those aspects of their infrastructure.

Talk about doing more with less. Andrew Page in the Office of Information Technology at Rutgers University has seen IT budgets go from lean to next to nothing. In this unprecedented time of state educational cuts, Andrew, responsible for enterprise monitoring and service management has turned on and been turned on by Splunk. The self confessed “ITIL guy” at Rutgers, Andrew oversees operations for systems for 50,000 students on campus in three different geographies (Camden, Newark and New Brunswick. The university’s back office supports 27 degree-granting units offer majors in more than 100 fields, with thousands of courses covering the full range of human experience.

The Splunk Set-up

The Rutgers Splunk set up includes

  • 2000+ data sources,
  • 1,850 network devices,
  • ~100 Servers: Windows, Solaris, Unix,
  • ~50 J2EE apps
  • 5-10 GB logs and messages / day
  • 95% coverage of infrastructure in Splunk
  • 40+ users
  • Single Splunk Server

Why Splunk?

Six months ago Rutgers was facing a number of log consolidation drivers including:

  • The need for real time access for production logs by service teams,
  • Faster cross-silo problem resolution and collaboration,
  • Simplification of problem troubleshooting for load balanced applications,
  • Decommissioning of “critical” monitoring scripts running in home directories and
  • GLBA and PCI compliance and regulatory reporting mandates.

Fast Implementation

Can you fully implement Splunk in a few days? Yes you can according to the Rutgers team. From download through basic implementation took 1.5 weeks and only part of a single resource. The Rutgers implementation included roles for data security, form searches and transaction searches, and custom dashboards.

Performance Management

Andrew and his team use Splunk to grab performance data. A scripted input makes HTTP calls into running JVMs. The team graphs this data and correlates it to load and error messages.

Outage Avoidance

In other scenarios Andrew presented how the Rutgers team finds problems before they become widespread outages. Eight weeks ago a certificate error started causing application failures and could have resulted in widespread outage. It took 6 minutes to answer…

  • Who was affected?
  • What time it happened?
  • What apps were involved?

Lessons Learned

Some valuable lessons from the Rutgers team include and emphasis on distributed deployment and the key to speed of installation. Second, think about security before you start. Third, during deployment get others involved quickly.

We had users on day two. The rule is that if you send in data you get a Splunk account.

Your early adopters will build their own solutions, but make sure you plan for availability as users become dependent on Splunk quickly and will notice any Splunk outages fast, fast, fast.

What’s Next

  • Expand use in the Application environment,
  • Feed in Oracle databases,
  • Migration to Splunk 4, of course…,
  • Expanded roles and security around roles should be big win,
  • Improved dashboard cache controls and
  • Offer some in-house training in advanced skills.
Posted by