Last week we continued our road show launching Splunk 4 through the Southwestern US in Phoenix, San Diego and Los Angeles.This was our second annual gathering of customers, partners and users and we had more than double the attendees at this year’s Splunk Live events. In the morning we held a three-hour hands on technical workshop. Attendees had the opportunity to install and configure Splunk 4 on their laptops or remote server and get one-on-one assistance from the Splunk team. Afternoon sessions and dinner focused on customer presentations. We’re very grateful to all the presenters who took time out of their busy days to share with everyone how Splunk is transforming their IT environments. I captured some notes from the week and thought I’d share them with you.
In Phoenix we had a packed house at the Sanctuary conference center on the side of Camel Back Mountain. At 109 degrees I decided against hiking up it in the early AM. Dave Bridgeman, Data Security Engineer at Early Warning kept things cool showing the audience how his company’s use of Splunk in their security operations center. Early Warning collaborates with major financial services companies to facilitate fraud detection through shared information and knowledge in cross-institution environments. The company has an interesting history having spun out of First Data and is now primarily owned by Bank of America, BB&T, JPMorgan Chase and Wells Fargo.
Dave is a well rounded IT professional who started as a developer then moved into network and security management. He current leads the data security team for Early Warning. The environment he over sees includes a variety of platforms including AS400s, MP300s, AIX, Solaris, Linux and Windows. He uses a combination of Splunk forwarders and syslog forwarders to collect Java and Cobol application logs and FTP/SFTP networking logs.
The Early Warning Splunk installation is designed to track transactions and users from one bank to the next in cross-institution activities. Transaction ID tracing correlates events across applications and services and Splunk alerts the team when jobs fail so the operations and development teams can securely troubleshoot issues on the fly. And remote accessibility mean no more driving into the office to access locked down servers in the middle of the night. On the security side of things Splunk helps Dave’s team track and monitor known fraudsters and bad user names allowing them to stay vigilant when monitoring external attacks. They also use Splunk to deliver reports for customers, executive committee members and the Security Advisory Committee (with representatives from the founding banks).
Henry Grant of Amkor a $2.1B provider of packaging/assembly and testing services for the semiconductor industry also presented an overview of how his Corporate Data Center team uses Splunk. Henry overseas operations for the company’s SAP, PLM, Supply Chain, Hyperion and Oracle systems. Amkor has a heterogeneous environment of Sun Solaris, IBM iSeries, Cisco ASA firewalls, packaged and custom web and J2EE applications and TACAS/Radius accounting and access control technologies. With manufacturing locations in China, Japan, Korea, Taiwan, Singapore and The Philippines and headquarters in Chandler, AZ, the Amkor team is challenged with log and event data overload. GBs of data a day generated at multiple points makes operational troubleshooting and security investigations extremely complex.
Proving SOX compliance has traditionally been handled by writing and maintaining scripts to collect and report on errors, access controls and log access activities. It was impossible to segregate duties given the lack of access control to the logs and events themselves. Splunk has taken the place of the awkward script writing and maintenance to collect iSeries, Unix and application events and logs and provide automated schedule reports. The team is now expanding the Splunk footprint to handle network and Oracle logs as well.
Application and System Monitoring
Like most enterprise IT shops, Amkor has figured out that traditional point monitoring tools aren’t enough as they have a hard time scaling to all the modern day technologies, require intrusive agents and only work for known events but don’t handle anomalies and unknowns. Too many issues end up being reported by end users themselves rather than the monitoring systems. With Splunk Henry’s team detects event anomalies in real time and has dramatically cut their response time by hours per incident.
Tools for the Help Desk
Sometimes it’s the simple things that can cut your response time, escalations and IT budget. The Amkor team noticed a lot of calls and emails regarding VPN set-up and access across the company. With Splunk level 1 help desk agents are now able to resolve most of the VPN issues without creating an escalation. Henry’s team built a VPN dashboard driven by a series of searches and reports that gives entry level help desk personnel the insight they need to troubleshoot problems right away.
Henry’s Splunk Tips
The best part of Henry’s overview were the tips for a successful Splunk implementation. I’ve included the list here in hopes that these may help you as well.
- Provide training that caters to each group’s need.
- Utilize the deployment Server.
- Develop a Common Information Model.
- Update and change as needed.
- Use Tagging to Normalize Data.
- Monitor Scheduled Compliance Reports by using the Audit Logs.
- Splunk into your processes where possible.
- Setup Test/Dev Environment and a Test/Dev Index .
Intuit Consumer Group
The Intuit team of Jeff Ludwig, Chief Architect and Larry Raab, Architect of the Consumer Group joined us to share how use Splunk in production support operations. Jeff leads the Consumer Group’s Connected Services Development for electronic and print tax and payroll filings for TurboTax, ProSeries, Lacerte and QuickBooks. Larry speciali a large-scale, highly available application and systems architect responsible for the consumer group applications and infrastructure.
While the original use for Splunk at Intuit was application management, Jeff and Larry covered three additional ways they have applied Splunk including reliable monitoring, improving user experience and large-scale reporting for compliance and business intelligence.
Inuit’s Consumer Group problem is very common. Several services, dozens of machines per service, dozens of log files per machine. Tracking down error logs took hours and correlation across logs and services was nearly impossible. With Splunk the team finds answers in minutes, keeps developers off of production machines and can now correlate across the entire organization and environment – something that is providing them with incredible new insights.
Jeff and Larry summed up their legacy monitoring systems in this way, “Monitoring tells us WHAT, but Splunk tells us WHY.”
The Intuit Consumer Group team uses lots of other monitoring and alerting tools for networking, servers and applications, but Splunk tends to be more reliable and is the most powerful in terms of features and speed. But the biggest advantage Jeff and Larry see to integrating Splunk with their current monitoring systems is that they can create ad-hoc alerts with Splunk – getting smarter about their environment on the fly.
Improving User Experience
For Intuit’s Consumer Group, when it comes to tax and payroll offerings every transaction completion is critical. But, each transaction goes through several services and many different technologies. Splunk consolidates disparate pieces of the transaction environment so the team knows when something goes wrong and how to fix it.
As Jeff points out, “with Splunk we’re get more intelligent about our users behavior so we can offer them a smarter and better experience.
Consolidating Intuit’s Consumer Group’s messages, events and logs has finally make reporting easier and faster for
- Internal data and security audits.
- Financial audits.
- Operational metrics and statistics to plan future deployments and developments.
Thursday we headed up the 405 from San Diego to LA for the last of our Southwest tour. The W Hotel in Westwood was once again the location for our second annual Splunk Live LA. It was a lively scene around the hotel which is just blocks from the Federal building where a police chase ended up in day of traffic snarls and helicopters hovering noisily overhead all day.
Fortunately we had Jon Hart, Manger of Production Engineering at Edmunds and Jeremy Custenborder Senior Performance Architect at MySpace to share how they have deployed and are using Splunk.
We were fortunate enough to have Jeremy Custenborder, a Splunk fan and Senior Performance Architect at MySpace drop by to share his experiences identifying and troubleshooting performance issues with Splunk. Jeremy is responsible for performance management across multiple datacenters and thousands of database, web, indicator, index and cache servers and switches, routers and load balancers for MySpace.com.
Lots of MySpace friends generate gigabits of traffic at a time and Jeremy makes serious use of Splunk to keep on top of overall site performance.
Jermey says, “Unstructured data rocks!” I happen to agree with him. His advice is to get the data into Splunk, then figure out what to do with it.
His Splunk installation includes four indexers per datacenter on a 1 GB network with Raid 1+0 volumes; four cold storage servers on a 1 GB network with Raid 6 volumes and two distributed search servers.
Data gets into Splunk in a variety of ways Unix servers use syslog, Windows servers use a custom MySpace agent, .Net applications make use of a Splunk log4net appender Jeremy wrote and has published for others to use as well. The Splunk log4net appender provides both UDP and TCP based transport with failure detection and dynamic configuration via DNS. Why didn’t I think of that? DNS makes total sense for forwarder configuration.
You can download the Splunk log4net appender. It is available for use under an MIT license.
Today Jeremy has Splunk performing real time alerting of error data, searches for patterns of suspicious behavior and uses data from Splunk to recreate error in development environments. He plans to start building custom dashboard for development with data specific to each development team and is busy integrating the MySpace performance monitoring system with Splunk to get early detection of new trends and provide fast right click investigation from the performance console.
Edmunds has been using Splunk for almost two years now primarily in fraud and security operations. The company is a incredible resource for automotive consumers and enthusiasts. Jon is a self professed Security Ninja and SysAdmin who enjoys racing cars and mountain bikes when he’s not Splunking security incidents. Data comes into Splunk via syslog, a custom agent for windows event logs and .Net application data via a custom log4net appender Jeremy wrote and has published.
Edmunds has more than a thousand devices and servers powering their business with many different logging mechanisms and locations.
Like many enterprises they previously built their own log analysis tools but have replaced those efforts with Splunk. In Jon’s words, “we’ve got better things to be doing around here!”
Edmunds Splunk environment consists of
- 11x 8-core, 64-bit, 16-32G RAM, 300G 15k RPM local disk, 2T NFS (3.4)
- 6 indexers, 2 Splunkweb (1 corporate, 1 production)
- ~60-70G/day, increasing to ~100+G/day soon
- NFS, syslog-ng, Splunk forwarders
- Apache, WebLogic, F5, Oracle, Web Crossing, a metric ton of syslog
- 9 sources, 6 sourcetypes, ~1000 hosts
- distribute search (Splunkweb, CLI) across all indexers
- Centralized Splunk management FTW
- 10 classes outside of per-machine classes
- LDAP + AD integration, per-group authorization and
So what is Jon and Edmunds doing with this set-up?
Real-time Alerting and Historical Trending
Edmunds uses Splunk to monitor the good, the bad and the ugly. Good includes traffic trends are tracked and reported on to ensure revenue and analyze trends. Bad consists of port scans, aggressive spidering by search engines and other bots and device failures. And ugly is of course anything that disrupts revenue and Edmunds money making IT look bad.
Developers, engineers, admins, analysts and even managers have visibility into everything. For every application, there are easy Splunk forms for things like errors by environment , host or time including cross-application (think web tier <-> app tier correlation).
For everything that logs data, Edmunds appends a few simple pieces of data that makes everyone’s job a lot easier. I’ve never seen an organization so organized with their logs and events!
- Environment (PROD, TEST, QA, DEV, etc)
- Tier (App, Web, DB, Admin, etc) and
- Normalized source name (“apache” instead of /var/log/httpd/…)
Using this simple organization and a few Splunk search commands, Edmunds drives a series of daily and weekly trends like daily, weekly “Top X” error reports for Web and Application tiers. These trends can also can an eye on the complete build process for monitoring of error diffs between data and build numbers allowing Edmunds to catch error before production code rolls. Developers, not administrators can now monitor and diagnose errors during the development process more effectively. Recently this type of diagnosis and trending has been used to even prioritize development tasks. For example, when someone complained that a particular feature didn’t work with a particular version of Microsoft Internet Explorer, the developer in charge used Splunk to become the voice of reason, discovering the issue impacted only 0.06% of traffic to Edmund’s web sites.
Edmunds has taken a similar approach to simply organizing their Security logs and events by normalizing data from Cisco devices, Netscreens, Sourcefire and Access Control Systems. Normalized fields include src_ip, dst_ip, src_port, dst_port, and protocol. So searches like startdaysago=1 src_ip=22.214.171.124 dst_port=80 will work regardless of log format. Now Jon can easily answer the question of “Who done it?” Without a single source for all security data and cross-device correlation that was previously this use to take a long time and often be impossible.
Before and After
Jon offered this comparison in a example of life before and after Splunk. Edmunds makes heavy use of HTTP logs for all kinds of work. Recently an HTTP log from 6/5/2009 (7G compressed, 60G uncompressed, 115M events) was used with a goal to find the top 10 referrers generating 404 (not found) errors. Before Splunk he’d Gzip/grep/awk/sort in about 7 minutes time. With Splunk he can index in Splunk, search, sort in a mere 58 seconds. Summary indexing in Splunk reduces that to 13 seconds. And this is all on Splunk 3.10. When Jon migrates to Splunk 4 he will be 5 to 10 times faster still.
Summary indexing is a great way to calculate ongoing stats in Splunk and Edmunds makes use of it not just for referrers but for status, method, URI, and UserAgent. Then they combine summary indexes for status, method, URI and referrer across WebLogic, Oracle, Tomcat and Apache to baseline different types of transactions and monitor anomalies.
The Bottom Line
Even though Jon is highly technical, he has been incredibly effective at translating the benefits Splunk brings to Edmunds in business terms. He’s learned this is the only way IT gets to make new investments. He justified the purchase of Splunk by demonstrating it has drastically reduced MTTR for revenue impacting incidents and helped ensure a steady flow of online ad revenue from the four Edmunds Web sites. But the IT and Security teams at Edmunds know there are a number of other advantages. The continuous improvement through automated error reporting and trending, elimination of the “log god” bottleneck, much more productive cross-team debugging and investigations and being able to satisfy that “I wonder if . . .” curiosity in the every day course of doing their jobs are all make their jobs a lot easier to do.
What’s Next at Edmunds?
The Splunk deployment continues to move forward at Edmunds. On Jon’s list of improvements for the next several months are
- Dedicated summary indexers.
- Longer retention periods.
- Double indexing volume by 2010 (more RAM, more storage) .
- Windows event log.
- Splunk 4.0 migration.