Rearden Commerce Case Study

Thanks to Splunk, we can respond much faster and be responsive. What's happening to our applications is no longer a black hole.

- Chris McDaniel, Director of Operations

Application Area: Availability

Customer Profile

Rearden Commerce provides the world's largest marketplace for services of all kinds, including travel, entertainment, package shipping and meeting services. Through its online personal assistant, employees can purchase services from a trusted network of over 130,000 global services merchants based on personal preferences and company policies. Its customers include major global enterprises like HP, Genesys, Motorola, GlaxoSmithKline and Whirlpool. Its services are provided by a network of service providers including major online travel, entertainment, shipping and meeting service providers.

Rearden Commerce operated largely in stealth mode for six years while its team researched, designed and developed a robust, scalable commerce platform built on a native services-oriented architecture (SOA) and developed a set of composite applications on that platform to solve real-world business problems. Rearden has now been live for just over a year and has met with overwhelming success and growth in volume.

Rearden Commerce's new director of operations, Chris McDaniel, arrived with a mission to put in place IT infrastructure and operations processes that would scale to meet the demands of rapid growth.

Business Challenge

Chris knows that the ultimate challenge of running software-as-a-service would be to achieve extreme availability – Rearden’s service must be online for its customers 24/7/365 or it risks losing revenue and accounts. In his view, avoiding and recovering quickly from problems would be the key to achieving this high level of availability.

Yet Rearden's operations team, like most IT organizations, had never had an easy way to analyze the operational data from logfiles and other sources proactively – which would be the natural place to look for latent issues.

They also had to fight with cumbersome homegrown tools and scattered data sources anytime they needed to look at logs to investigate an issue – so investigations were shallow and focused on symptoms. Because of the different formats and locations of data, these investigations also involved too many different people with different skillsets across operations, development and QA – an unacceptable burden in a fast-moving startup.

Finally, even the basic homegrown logging tools Rearden already had were an unacceptable distraction for its developers to maintain – they should be working on their unique platform, not operations tools. Not to mention that over six years, developers of some of the tools had left the company and left operations in the lurch.

Seeing what's on the horizon

Chris decided he needed to put a commercially supported, easy-to-maintain tool in place that would provide his team of system and network administrators a good sense of how the infrastructure was behaving. He felt this would make it much easier to spot impending problems in production. Better yet, if this tool could be used by developers and QA to baseline the operations characteristics of new features prior to rollouts, he’d be even more ahead of the curve.

Instantaneous response

At the same time, for those problems that still happened, the tool should enable admins to analyze the issue quickly and thoroughly, with as few people as possible involved.

Technical Requirements

System Architecture

Rearden Commerce's environment consists of about 150 midrange servers, all running Red Hat Enterprise Linux. This includes Tomcat J2EE application servers, Apache web servers, Oracle database servers and a number of supporting systems administration servers. The applications deployed on Tomcat servers are mostly all homegrown, with the exception of a few supporting open source applications. They connect to their network of service providers via real-time web services requests.

Team

Chris's operations team consists of about 8 people, including systems, network and database administrators. They work closely with other Rearden Commerce business partners in customer support, development and QA. All of these groups either look at logs themselves or escalate issues that require looking at logs. All in all, as many as 20% of Rearden's personnel need regular direct access to data in logfiles.

Before Splunk

Prior to implementing Splunk, each production application and web server logged to its own local filesystem. Log rotation scripts managed data retirement. The Oracle database logged internally to audit tables. Syslog from hosts and devices was partially centralized.

There is a homegrown "opslog" tool that scrapes the Tomcat log4j logs for critical errors and sends alerts – but this tool is hard to maintain and has to be updated regularly with specific patterns to find.

Rearden is implementing a combination of Nimsoft's NimBUS and the open source Nagios package for system, service and network monitoring and alerting. However, these tools still require admins to comb through raw logs and application reports for actual troubleshooting.

Figuring out what actually went wrong on the server would require an escalation into ops, who would take a scary dive into raw logs full of 100+ line stack traces. If any of the errors smelled even vaguely of the database, it would move on to DBAs doing mysterious queries into their audit logs. Likewise, if connectivity was suspected, the network admins would grep through their syslogs. Getting to the bottom of a serious issue could distract multiple people for hours or days. As often as not, the investigation would end at the first good guess because everyone involved would get pulled onto other tasks.

Splunk at Rearden Commerce

Chris chose Splunk because he’d already adopted Splunk at his previous job and he knew how easily it could solve Rearden's IT data analysis problem.

Chris and his system architect, Stan Chan, implemented Splunk Professional in May 2006. They initially configured it to consolidate logs from their web servers, application servers and host syslog. The initial setup and configuration took a few hours and from that point they were splunking live data.

They use Splunk-2-Splunk distributed data access in order to centralize their Tomcat log4j files and Apache logs from their production host to a central Splunk index host. Splunk instances on each production host tail the appropriate files and forward them over TCP to the central host.

For their host logs, they use remote syslog and a syslog-ng instance on the central Splunk host.

They plan to add their Oracle audit logs to the mix soon, but are debating between accessing the Oracle audit tables via ODBC or OCI (Oracle Call Interface.) They’re also about to roll out Splunk-2-Nagios to achieve an integrated workflow for alerting and investigation.

While Splunk was initially intended for the operations team to use in the production environment, the QA and development teams quickly saw the value in the solution and implemented it in the non-production environments as well. That way, all teams could work together to baseline the activity and benefit from the rich troubleshooting capabilities during the development and testing cycles to understand the impact of new software before it rolled out to the production site.

Scenarios

Why did this customer have a problem?

Operations and developers turn to Splunk when the Customer Support team escalates customer transaction and usage issues – anything from trouble checking out to errors while selecting any of the over 130,000 services available on the Rearden Commerce platform.

Thanks to their Tealeaf solution, the Customer Support team is able to verify that the customer really did have a problem by replaying the customer's session activities, then forward the information to the operations and development team to review while using Splunk to navigate the application server logs and follow the trail of errors to understand why the issues occurred. Splunk has slashed the amount of time it takes to do a typical investigation, while allowing a much deeper investigation, by putting all of the logs from all hosts in one place and providing an easy, fast way to follow the trail between different related events.

Alert me to application errors.

Splunk is replacing Rearden's custom "opslog" tool which scraped the Tomcat logs for terms and error codes with Live Splunks. The Live Splunks are easier to maintain and can leverage Splunk's automated event classification to be much more specific in what kinds of issues they include or exclude from the alerts.

Live Splunks can be set up and received via email by any operations, development or QA team members.

What's Normal?

Chris has encouraged his team to use Splunk to explore the behavior of their systems as seen in the logs under normal operating circumstances – not just when something is broken. He wants them to know what log messages are common, what the normal patterns of activity are during the day. Tagging event types and contributing richer descriptions to Splunk Base is a good way of capturing this view of what happens everyday.

The point of looking at what's normal is to be prepared to more quickly spot what's new or different when a problem does happen. If an intermittent error in the logs that is really harmless has already been seen and tagged as "ignore", it won't become a time-sucking dead end in an investigation under fire three weeks later.

With Rearden's QA and developer teams implementing Splunk in test as well as production, this baselining is extending back before actual production deployment. Chris' team gets to see how new code is going to change what's normal before it goes live, making the more frequent escalations following changes go much quicker.

Keeping an eye on things.

Rearden's admins use Splunk proactively to observe new patterns in system behavior several times a week. They look at activity for each host over time, break down activity by event type and look into what seem to be new kinds of events.

The first few weeks of using Splunk already led to the identification of several issues that might have resulted in hard-to-trace errors if not caught. In one case, Splunk revealed load balancing problems that were causing unnecessary bottlenecks in the production service. In several others, admins found inconsistencies among the configurations of different hosts.

close

Flash required to play this video.

Click here to download the free Flash Player.

Description:

Permalink: