Open Source Lab Case Study

When I get the 'my mail isn’t going through' call, it’s a tricky thing to troubleshoot. With all of our logs in Splunk it isn’t a problem.

- Corey Shields, Infrastructure Manager

Solution Areas: Application Management, Network Management, Server Management, Service Desk

Customer Profile

The Open Source Lab at Oregon State University provides hosting for many of the most high profile open source projects in the world—the Linux Kernel Project, the Apache Software Foundation, Debian Linux, Gentoo Linux, OpenOffice.org, KDE, Mozilla and Drupal. OSL offers both managed and unmanaged hosting services.

Corey Shields, OSL's Infrastructure Manager, is in charge of computing infrastructure. He's the one who makes sure project contributors can access source control, wikis and other applications 24/7. Corey is assisted by a team of administrators, mostly part-time interns from OSU’s student body.

An Open Door for Open Source

Change is the one constant in Corey's work. OSL has what the lab calls an "open door policy"— they will accept any open source project that comes knocking and accommodate any technology the project uses. That's great for developers, but for Corey it means no ability to plan capacity or develop operations procedures based on standardized configurations. He can't use monitoring and management tools that require too much customization to fit different applications and platforms. At the same time, Corey's intern staff turns over much more often than salaried staff in a corporate data center. He needs to constantly train new recruits on what their predecessors figured out before moving on.

Business Challenge

The main mission for IT at OSL is to maintain 24/7 on-demand availability for thousands of developers worldwide, despite the constantly changing environment. The obvious half of the equation is to minimize downtime. Corey has already done everything he can short of changing the lab's open door policy to remove unreliable components from the network.

Of course, things still break. The other half of IT is finding and fixing problems faster, hopefully before they affect customers, and finding root causes rather than surface symptoms. To meet that challenge Corey turned to Splunk.

Technical Requirements

System Architecture

The lab’s size has more than tripled in the past eighteen months. OSL currently hosts 130 servers, of which about 70 are fully managed by OSL staff.

The servers run a variety of Linux distributions, mostly Gentoo and Debian. They host a range of Apache-based Web services, source control systems, content management systems, wikis and many underlying databases, mostly MySQL. OSL has three FTP servers in Chicago, Atlanta and Oregon.

Operations before Splunk

OSL had already implemented a central syslog-ng loghost to centralize OS logs from all 130 servers. Stunnel encrypted communication ensured data security and integrity. Still, log analysis was troublesome. The central log host maintained separate files for each server and day. Whenever anyone reported a problem, Corey's team needed to comb through these files with grep and awk. It was time-consuming, and required that they already had a good idea what messages or values they were looking for.

Apache logs were too difficult to centralize via syslog so they were kept as separate files on each of the production hosts. This also bogged down troubleshooting and made it hard to spot patterns across multiple servers.

Splunk at OSL

Corey first saw Splunk at the LinuxWorld conference in San Francisco in late 2005. He inquired about a Splunk Professional license in February 2006.

To get started, Corey implemented Splunk1.2 on his central syslog-ng host. For maximum performance, he configured syslog-ng to write to a named pipe (FIFO queue), which Splunk then accessed via its FIFO input module. Corey began Splunking on a daily basis instead of grepping nearly a gigabyte of raw logs per day.

With the release of Splunk version 2.0 in May of 2006, Corey was able to expand his deployment to include all of the access and error logs from his Apache web server hosts. He runs Splunk instances on each host to locally tail the live Apache logs via Splunk's tailfile input module. But instead of indexing locally, those Splunk instances forward the data via TCP to a central Splunk indexing host via the new Splunk-2-Splunk feature introduced in Splunk 2.0.

Next Steps

Corey is adding the full logs from OSL's three FTP servers, which will boost his Splunkable data volume to between two and five gigabytes per day. Splunk can handle ten times that volume on one server, so he's not worried about capacity. He's also integrating Splunk tightly with Nagios, an open source system monitoring tool, using the free Splunk2Nagios integration kit. Admins will be able to receive live alerts from Splunk on the Nagios console and splunk their Nagios events along with the rest of the network.

Scenarios

You’re losing my mail messages!

Like many admins, Corey says the most disruptive requests are interruptions from users convinced that their mail isn’t going out:

"If I get the 'my mail isn't going through' call, well, given that we have more than one mail relay it’s a tricky thing to troubleshoot. With all of our logs in Splunk it isn't a problem because the search can span across multiple hosts, so I can just start looking for a user's email address in the logs, find out where their email ended up (if at all) and help them from there."

Proactive analysis of a host

Corey’s a big fan of proactive splunking. He's the kind of admin who wants a constant birds-eye view of what is going on in his environment. As he posted to his own blog recently:

"The interface allows you to modify your search on the fly (ctrl-click will add that word, hostname, pid, etc. to your search. ctrl-alt-click will exclude it from your search.) Using the latter method, I can search through a host’s logs, excluding things I know are okay, and in a matter of minutes I have found errors and problems in my system just by eliminating log entries that I know are okay or legit."

One of the errors he found in his system, without specifically knowing to look for it, was a rogue cron job. As Corey put it when speaking with NewsForge reporter Tina Gasperson:

"Almost immediately Splunk showed its worth in helping to find problems I didn't even notice the symptoms of… I was using Splunk to browse the logs of one of our development testbeds and noticed a cron job that was running every minute out of an old account from a developer who had left the group six months before. Given the alternative of just looking through the log one page at a time, I would not have been scouting for possible problems."

Live Splunks to monitor a trend

Corey was able to pick up a denial of service attack impacting his FTP servers in his first few days of using Splunk.

In his initial splunking of the FTP host’s logs he noticed a somewhat high, and rising, number of individual IP addresses exceeding the "per IP rate" that limits the number of FTP connections each client can make. This happens with download managers, but in excess it is indicative of a denial of service attack. They clearly stood out when he looked at the results of a search for that host’s logs and used Splunk’s Event Type summary view and Events by Time histogram.

There weren't enough connections yet to cause an outage, so Corey decided to let Splunk keep tabs on the situation. He created a Live Splunk to notify him if the number of such events rose more than a specific threshold. Corey wrote a script to have the Live Splunk fire off a message via an IRC bot, since OSL’s sysadmins live on IRC day and night.

Sure enough, a couple of days later, the Live Splunk messaged OSL's chat room. There had been a massive increase in excessive connection events. By splunking through the individual events, Corey saw in seconds that OSL was under a denial of service attack from a single IP address.

Without Splunk, Corey would have found out about the problem only when users complained about slowness, or worse yet when a service died. He would have had to grep through many different log files, or write a one-off awk script, to identify the root cause. Had he done it that way, FTP service would have been impacted much worse, for much longer.

"Grep and awk give you a 'fire and hope you aimed right' interface to finding what you are looking for... Splunk's interface gives you a way to 'aim that bullet' once it leaves the chamber."
– Corey Shields

Giving back to the Splunk community

Corey plays a major operational role in some of the world’s most widely-used open source software, so he sees an enormous potential to foster community-based troubleshooting via Splunk Base. Corey has a natural tendency to share his knowledge, and he also has real needs that Splunk Base addresses. OSL's constant staff turnover necessitates that interns store their hard-won knowledge where the next intern can find it while troubleshooting. With Splunk Base, the intern's knowledge is instantly shared worldwide. As they give, they also receive; when a new project brings in yet another new tool or application, OSL staffers find information about it already stored at Splunk Base by other admins at other sites.