I talk to many sysadmins every week who have started to use Splunk and know that it’s saving them time and avoiding problems for their company. It’s time to ask the boss for the money to buy. In many companies, the boss then wants to “see the ROI.”
A lot of admins are stymied by this – while it’s intuitive to them that they and their colleagues spend a lot of their day looking at logs, and it’s also evident that Splunk makes it go faster and better, they don’t really know how to go about quantifying it.
Never fear – if the benefit is clear to you, it’ll be easy to document in an ROI analysis. This post will explain the basic model we’ve developed to calculate ROI here at Splunk (email us if you want to get access to the model itself), and offer some tips & tricks for getting solid data to plug into it.
The major benefits most admins see and want to articulate are:
- Reduced downtime: how much less downtime because problems are solved faster and don’t recur or are found proactively, how much per hour of downtime avoided based on revenue, SLA, or productivity.
- Reduced labor: how much less time to investigate incidents because logs are more accessible and can be searched faster/more effectively.
The major inputs you need include:
- How many incidents per week, on average, do you get that require looking at logs and/or cause downtime? It’s a good idea to come up with 2-5 major profiles of incidents and a separate total for each. For example, a trading firm we’ve worked with distinguishes between routine “did this trade go through?” investigation requests, and “the system is not executing some trades-what happened?” troubleshooting requests.
- What is the extent of impact to availability on average of each type of incident? Total or partial outage?
- How long does each type of incident take to resolve? How many people are occupied what percent of their time while it is being worked on? What types of people are involved? How much do they cost?
- Is your app something internal users need and/or is it something that generates revenue from external customers? How much revenue does your app generate?
- How much labor time is being spent maintaining homegrown tools?
Once you have these basic inputs, you can calculate a pretty good baseline of how incidents are impacting availability and labor cost today – the “do nothing” scenario. This alone is good management info to have, regardless of whether you buy Splunk.
From there, quantifying Splunk’s benefit is a matter of estimating:
- How many fewer of each type of incident you will have because you are proactively looking at logs and also finding root cause the first time each problem happens.
- How much faster each type of incident will be resolved – which impacts both labor cost and availability.
- How many fewer people will be involved in each incident, and how their profile will change because individual admins can see and understand the whole picture, and possibly tier 1 / help desk can do more.
At this point you’re probably saying “Well, that’s great, Christina, but I don’t know where to get all that data!” Actually, it’s probably much easier than it seems. Here are some tricks I’ve learned over the years that work when trying to get these numbers out of different organizations.
Tips & Tricks for Getting ROI Data
- Focus on what you know best. Sure, Splunk is going to impact everyone who looks at logs in every context. Don’t try to measure it all up front. Identify those specific types of incidents you personally understand the best, where the data is most readily accessible, and you are most confident of the benefit. If you’re a DBA and the only routine log analysis tasks you know are 1) proactive review of database updates/inserts and 2) troubleshooting slow transaction queries, get numbers on just those two things. You’ll end up with a fairly conservative number but even that number will probably more than justify the purchase while showing that you’re not being speculative.
- Query your ticketing or IT workflow systems. If your company tracks escalations from the help desk to senior admins via a ticketing system like Remedy or uses workflow features in its systems management platform such as Tivoli or Unicenter, you can probably run a report for how many incidents of different kinds you’ve had and how long the tickets were open. These systems probably undercount unofficial escalations, and may not have particularly detailed notes about what kind of analysis was done, but can still be pretty useful.
- Use salary surveys. You probably don’t know (or shouldn’t circulate) what different people in your IT organization make. So how to quantify labor cost savings? Just use industry averages from IT salary surveys, and use geographic adjustments where appropriate. Infoworld does the most comprehensive survey each year, and even included this handy calculator in this year’s survey! The number might vary a bit from your organization but you are at least presenting a credible estimate from a credible source – your boss can do the +/-10% math in her head if need be.
- Do a top-down sanity check. The model I’m describing here is pretty bottoms-up – calculating numbers of incidents, costs per incident, etc. It’s a good idea, especially when presenting to very senior management, to balance this with something coming down from the top level. IBM has done some useful research to determine that 30-70% of IT professionals’ time in medium to large companies is spent in dealing with problems. Take the low side of that number – 30% – multiply it by the average IT salary per the Infoworld study – and multiply that by the number of IT people in your company. Then take a conservative estimate of Splunk’s impact on that time. It’s a good high level check that your bottom-up number isn’t too high. It can also help you make the case that this is a top priority.
- Model a few specific recent incidents. Do a post-mortem on the most recent representative incident for each incident category where you’re thinking of using Splunk. Interview everyone involved about exactly what they did and how much time they spent. How many minutes did it take to run the scp job to move the relevant logfiles to a central workstation for analysis? How many separate log requests did admins handle on behalf of developers investigating the incident? How long was the service impacted? What percentage of transactions failed? What lost revenue did those transactions represent? Then model how things would be different if you had Splunk. Document it and use it as backup for your averages and projections in your ROI analysis.
VP Product Management