There’s a big reason I haven’t blogged here for a while: Splunk 4. I’ve been so wrapped up in it for the last year that I haven’t really been interested in writing about anything else. Well, now it’s out, so I’m back! So I’ll kick it off with some background on why 4 is the Splunk I’ve always wanted and a little story about how my team and I have used Splunk ourselves in a new way the past few days.
The aspect of Splunk 4 that I’m most excited about is all of the ways that it makes IT data accessible to everyone, regardless of their job.
I’ve been a data fanatic since I started my first software company job 17 years ago and worked on forecasting and order management systems. I wasn’t a developer but I was able to build out quoting and forecasting systems and do in depth analysis using Filemaker Pro and Excel.
Since then, I’ve been involved in building out systems that let users analyze IT data in one form or another for 10 of the last 12 years, first running a tools team for MSN at Microsoft where my team spent $millions developing a log-driven executive dashboard, then at a pioneering log management vendor that moved from web analytics into SIEM, and the last 4 years at Splunk.
I’ve seen an unimaginable variety of functions and users that need some kind of information based on logs and other machine data. The further from software development or hands on systems administration they are, the less aware they are that the information they’re seeking is in a logfile somewhere. And even technical people who know what log it would be in may not have permission to access it.
If such an access-deprived individual is lucky, they have the power or influence to get a sysadmin to pull the data for them. If they’re not just access-deprived but technically handicapped, they also need to prevail on that sysadmin to write some scripts to massage the data into information. Then they need to trust that the sysadmin understood the business logic well enough to do the analysis right. It’s like the old story of the hungry man being handed a 3-foot long spoon.
Splunk 3 succeeded because it helped the access-deprived – which was huge in organizations hit hard with segregation of duty rules. But the non-technical user (or managers with technical chops but no time) still needed power users to run most analysis for them. Splunk 3 made it easier for technical users to fulfill the request but sysadmins still resented the distraction and savvy managers still worried about what was lost in translation.
That was as true here at Splunk as anywhere else. When we shipped 1.0, our own sysadmin kicked the tires a bit but still grepped (yes, I admit it). Somewhere around 2.x a real production setup indexing all our website server, access and error logs continuously started to get frequent usage by our web developers and sysadmins to troubleshoot problems. Yet all the time I’d sit through executive, marketing, sales, product planning and other meetings and listen to discussions where people were substituting guesses for facts – because the facts were buried in logs somewhere and our sysadmins were too busy to be burdened with one-off requests to run analysis.
As an example, I’d routinely ask Rachel, our Director of Documentation, for information about what docs topics were recently popular, trends in docs search engine referral terms, etc. as a guide to what we needed to fix in our product or processes. Sometimes I’d get the data, sometimes not, but it was always like pulling teeth. Even though Rachel and I are both technical enough to analyze a logfile the old way, we’d run into all kinds of roadblocks: switching docs platforms meant the logs stopped going to the system we were using, it was hard to set up a dashboard that we could both see, the stats we needed required analyzing more data than Splunk 3 could do on an ad hoc basis and we didn’t have the permissions to do any back end config, we can do regexes but it takes too much time to swap back into that way of thinking… Ultimately we were both busy managers that would give up and go back to executing on our core jobs, without the information we really wanted. Exactly the same stories I’d hear from Splunk 3 customers about why there were still lots of groups that could benefit from Splunk that weren’t yet doing so.
The tide turned last Friday.
In prepping for the launch I googled “Splunk 4.0” to see if people were already talking about it online. Lo and behold! Our own beta documentation, which was supposed to be locked down to beta customers, was in the google search results. Turned out that some special pages in the docs system enabled the google crawler to get to insecure versions of our beta docs at different urls than what you’d get by navigating our docs the regular way. A typical example of an unknown vulnerability in a web application’s security, just like ones I hear of from our customers all the time.
As the business owner of this web app the next thing I wanted to know was who had seen it that shouldn’t have, what they’d seen, so I’d know whether it was a big deal or not and could decide a course of action. Too bad our daily web stats wouldn’t give me any idea of traffic that matched this very specific pattern – I’d need some custom analysis of the raw logs.
I started behaving like any hands off manager would – I started writing email to our web producer and web developer to ask them to pull the logs and do the analysis and I whipped up a storm with their bosses so they’d be given cycles to work on it. Then I stopped myself and logged into our live Splunk 4 instance instead.
I first searched for all refers to the insecure uri pattern from google.com with search strings of “Splunk 4.0”. Almost nothing. Wait – that was my search and the site I used but other crawlers could index these pages, and our pre-launch marketing used “Splunk 4” not “4.0”. So I broadened my search to all refers to these uris from external domains to get a raw hit count – a few thousand. If I’d just gotten the total from our web guys and hadn’t been looking at the data myself I probably would have accepted the wrong answer.
So where’d these hits come from? 4.0’s new search assistant told me a common next command was “stats”. I clicked to add it to my search and I saw examples of past usage by others on our Splunk instance was “| stats count by clientip.” OK. Click. Now it suggested “lookup” (new in 4.0). Click. Now it suggested “| lookup dnslookup clientip” – sounds promising. Click. As Splunk streamed in new client IPs to build my table on-the-fly I saw familiar names pop up in the domain names – a lot of Splunk customers, one competitor.
Now I wondered what they’d seen. I couldn’t tell from this simple statistic on the initial referred request if they’d landed on one page and left, or navigated around to lots more pages. So I found (through search assistant) examples of using stats to list uris and added that to the stats command arguments.
I got my final result after just a few minutes – I had a table of results grouped by client IP and sorted in descending number of hits showing the first and last date they’d seen the special pages, their revdns hostname, the full sequence of URIs they’d viewed, and the referring domain and search query. The timeline at the top of the search view showed me that very few hits had happened before the launch webinar invitation went out. I shared an export of the results with impacted colleagues. We decided how to react based on complete information on the impact of the vulnerability. And I didn’t waste any of our web guys time while they were busy getting splunk.com ready for launch.
But that was just the first of many uses over the next few days. Yesterday, the day of the actual launch, I was more interested in keeping watch over whether initial downloaders were having a good experience, if they were just downloading or were reading the docs, and, based on the docs usage and search terms, what features they were trying first and what features may have been giving them trouble.
Now, I’ve been asking for a dashboard with this information for a while. But, tired of asking, I just went ahead and built it. I was able to put all of this information on a new docs usage dashboard and share it with support, documentation and other colleagues – all through the UI using the new report and dashboard builders and Splunk Manager.
The dashboard helped us identify some confusion around the need to upgrade to 4.x licenses which drove us to clarify the release notes and download page quickly. And now the whole docs team is enthusiastically using Splunk to better understand customers product and docs usage. They’re even planning on starting to use examples of their own usage to illustrate topics in the manuals.
I’m looking forward to seeing this tide turn for all of our customers too as others realize they can now get their own answers to all sorts of questions they used to leave unanswered.
VP Product Management