Case Study: Intraware

Splunk has made it possible for us to quickly report on log exceptions, contributing to our and our customers' SOX compliance efforts.

- Steve Loyd, VP Operations

Solution Areas: Compliance, Log Management, Service Provider

Customer Profile

Based in Orinda, California, Intraware, Inc. (NASDAQ: ITRA) is a provider of Web-based electronic license and software delivery and management solutions through its signature service, SubscribeNet® (patent pending). SubscribeNet offers a unified digital goods management and delivery platform that powers business-to-business technology providers in North America and Europe such as Adobe, Business Objects, EMC, Hyperion Solutions, IBM, McKesson, and RSA Security. 99.6 percent of Fortune 500 companies and 90 percent of Global Fortune 1000 companies have downloaded software or license keys on the SubscribeNet platform.

Intraware's systems play a critical role not only in their own revenue recognition cycle, but also that of their customers. Its many public customers are subject to Sarbanes-Oxley (SOX) IT audits. Intraware has consciously made strict IT process controls a core part of its value proposition.

To that end, Intraware has voluntarily set up a SAS 70 compliance program in addition to meeting Sarbanes-Oxley compliance mandates. SAS 70 (or Statement on Auditing Standards Number 70) doesn’t mandate specific controlsI it’s an internationally recognized standard for how IT auditors document an organization’s process controls and the verifications of these controls that have taken place. Intraware seeks to implement the highest standards demanded by any of its customers and uses SAS 70 audits to prove that they are in place. SAS 70 audit statements streamline the communication about Intraware’s process controls to SOX audit teams at each of its customers.

Today, Intraware contracts with Deloitte & Touche to perform annual SAS 70 Type 2 audits involving more than a month onsite each audit cycle. Additionally, Intraware’s own audit firm performs IT audits twice a year.

Business Challenge

Steve Loyd, Intraware's VP Information Technology and Security, now considers compliance to be one of his biggest responsibilities. Compliance-driven activities have added a significant labor burden. New security controls sometimes conflict with the demands of keeping the service running.

Time for compliance round 2.

Two years into this arduous new compliance program, Steve was on the lookout for ways to automate manual compliance processes and resolve the conflicts between security and availability. He also hoped to be able to upgrade some key process controls, since that would improve Intraware’s strategic Sarbanes-Oxley compliance positioning. Compliance is clearly not a one-time thing for Intraware – it’s a part of life for their IT team and can always be made better and more efficient.

Yes, we're reviewing the logs every day. Well, some of them. And it's painful.

Log review and access to logs practices were prime candidates for improvement. Prior to implementing the SAS 70 program, one of the most common IT compliance questions that Steve fielded from customers was "what are you doing with logs – are you keeping them and reviewing them?" Intraware had adopted log review policies that consumed 30-60 minutes of Steve's own day, every day, yet only covered the network and server tiers. This was because there was no effective way to centralize or analyze application logs.

So secure we can't fix it fast enough.

At the same time, developers working on production errors needed to request application logs from system administrators, because security controls prevented them from accessing production hosts where these logs were kept locally. The request process delayed fixes to production and added labor cost for both development and operations.

When Steve discovered Splunk, he saw how he could expand his compliance log review and centralization to his entire environment while slashing the time he would have to spend actually performing the log review. And his developers would be able to access logs in real time without logging into production systems or involving system administrators.

Technical Requirements

System Architecture

Intraware’s environment includes an Apache web tier, Websphere application tier, and Informix database tier running across Red Hat Linux and Solaris servers. Intraware’s network tier is comprised of Cisco, F5, Bivio and Checkpoint devices. Intraware's service is based on internally developed and supported applications with real time web services and EDI connections to partner services. They use Big Brother for monitoring and Service Now for incident management and ITIL process control.

Compliance Requirements

Sarbanes-Oxley IT compliance is focused on prevention and detection of financial reporting inaccuracies, fraud, and revenue-generating service interruptions. Auditors are therefore equally concerned with security and operations. Controls that auditors expect include:

  • Proactive log review. Auditors want IT managers to perform proactive review of log events that are known to indicate security and operations issues. They also want IT to review log events that are new to the environment – these indicate changes in systems and user behavior that should be investigated.
  • Log retention and accessibility. Logs necessary to investigate suspicious incidents and failures must be retained for ad hoc access.
  • Access controls and segregation of duties. Access to production systems must be restricted to systems administrators to show proper segregation of duties. Developers can’t log in to production hosts, even if they are responsible for troubleshooting production service problems.

Operations before Splunk

Intraware initially implemented a central log server using syslog-ng. Homegrown shell scripts scraped incoming logs for both "known good" and "known bad" message regular expression patterns and output two files daily: all log events that were known bad and all log events that were not explicitly known to be good. Steve then needed to spend 30-60 minutes each day manually reading through these exception files and using grep and text editors to skip through repeating events. He then needed to take the new messages that proved ok and update the "known good" pattern file with new regular expressions, so that he wouldn’t see those events again.

Despite all this work, the review only covered logs from operating systems and network devices. Application logs, which would be key to catching fraudulent activity by authorized users or problems in application logic, didn’t lend themselves to syslog centralization or Steve’s homegrown scripts. They were too high volume and would overflow the review process. Multi-line stack traces and errors would be treated as a single event per line.

Developers needing to access logs and other data on production systems were requesting logs from IT 8-10 times a week. The request process consumed 2-3 hours a week of IT time and delayed production issue resolution.

Splunk at Intraware

Splunk now centralizes and indexes both remote syslog and application logs across all of Intraware’s production hosts in real time. Its powerful interface and event classification technology accelerates Steve's daily review process and has replaced all of his homegrown scripts. Browser-based access makes it easier for everyone else on his staff and in development to access logs on an ad hoc basis.

Splunk was initially installed on the existing central syslog server. It indexes the incoming syslog messages as they arrive from both servers and network devices.

Splunk was then installed across production hosts that accumulate local application logging. These Splunk instances were configured to tail the existing log files and forward them over TCP to the central Splunk Server for indexing. In this forwarding configuration, Splunk maintains a tiny memory and CPU footprint so as to avoid any negative impact to production systems.

Splunk automatically classifies events based on their structure and keywords and lets users tag event types with their meaning and significance. Steve has leveraged this automatic classification and tagging for his daily log review. He tagged all the event types in his system when he first got started either "ok" or "not_ok" based on an initial search and comparison to his old "known good" and "known bad" files.

"Splunk has made our daily log review a lot easier because of its automated event typing and intelligent grouping of events. Splunk’s cut the time it takes for us to do this from an average of an hour a day down to 10-20 minutes most days."
– Steve Loyd

Scenarios

Review new types of log events.

Steve has set up a daily Live Splunk to look for new event types. This Live Splunk looks for all event types tagged neither "ok" nor "not_ok."

Splunk> NOT eventtypetag::ok NOT eventtypetag::not_ok daysago::1

He then reviews the events one type at a time by using the Splunk's results by event type view. He can quickly add the "ok" tag to event types that appear innocuous. If a given event type needs more investigation he can see how it trends over the course of the day, navigate to other events around the same time and filter by just clicking on any term in the raw events. If he wants someone else on his team to take a look at some events, he can email them a permalink to the same search results.

The next day's run of the "new event" Live Splunk will skip what he’s newly tagged "ok." His search history, which shows that he has viewed the "new event" search results and explored the results thoroughly, is saved to his search history log. The history log can be produce for auditors to show that the log review process is being conducted daily.

Alert on known bad log events.

Similarly, Steve has another daily Live Splunk that looks for event types he's already tagged "not_ok." This Live Splunk sends him an email, but he's planning on using the shell script option instead to send each event as an Incident in Service-Now so his team can respond.

Investigate production issues.

Developers log into Splunk directly to investigate production issues. They can search for the IP addresses of customers complaining about performance issues, search for specific errors, and navigate logs by time and keyword to narrow down problems. They see up-to-the-minute log events without having to make requests to operations or logging into production hosts.