Let me tell you a little story about something which I learned (or re-learned!) today. For the impatient, you can read Jack’s previous article on building technology add-ons, and go learn CIM (which stands for Common Information Model). I’ll put some other resources as the end as well.
The silly thing I have to admit first of all, is that I thought I knew this stuff. I’ve been involved in making data models for the CIM app, for cryin’ out loud! Anyway, to the story…
In my prior role in business development as a solution architect, and now as a developer evangelist, I frequently work with ISVs, IHVs, SIs and others who want to integrate their stuff with Splunk. This week, I was reviewing the code that two partners had written. Both of the products in question happen to have syslog data that is being indexed in Splunk. No big deal, this is like Splunk 101, right? Well, it turns out that both of the apps had been written in such a way that they overrode the sourcetype in Splunk.
And both apps use a pretty broad regular expression (in transforms.conf) which captures most every syslog event…
And looking at the data in question, there really are no unique strings which could be used to reliably separate things into two distinct sourcetypes!
Ok, what to do? The data has to be split apart in some fashion in order for each apps’ searches and dashboards to function as designed.
I will admit that I spent an embarrassing amount of time on figuring out how to split the data. I considered changing the regexes, but the data just wasn’t unique enough. I considered asking one partner to use the other’s app as a base, but that had a few drawbacks. I then thought about suggesting that the partners use other fields for their dashboards, such as tag or eventtype. But my brain was still not in the right place, because I kept thinking of how to split the data into two pieces. Sure, you could have the customer (that’s you guys) tag all of Product A syslog events as “ProductA” by IP address or whatever, but that would introduce some significant install time work, plus ongoing maintenance to keep the tags accurate.
That’s when it hit me. We don’t want to split the data at index-time by vendor/product at all. That would work, but it’s not looking at the data the right way. The better thing to do here is to put the data into meaningful categories by its type or function, and write searches based on that first, and as needed, filter by vendor or product second. This still means using tags, but not in a binary “you or me” way. In SPL, the old way is:
And the new way is
Or in some cases, maybe all you need is “tag=syslog”, especially if the data really is hard to distinguish from other types of data.
So this means no longer altering the data irrevocably at index-time (which is how sourctype overriding works). Perhaps overriding sourcetype (or other indexed fields) is going to be useful and needed in many cases. But–tags can be layered, and they are search-time, not index-time. This gives you tons of extra flexibility and is really where Splunk shines.
As Jack Coates said recently,
It helps to think of putting the data in as a totally separate action from taking it out.
With that, the story ends. But what about next steps? I pointed out CIM at the top, and that’s the key. Pick a data model from the docs which fits the type of data you are working with. If your data really doesn’t fit, that’s ok, make your own using the same techniques which we document, such as renaming fields, adding tags, and creating calculated fields and lookup tables as needed. The cool thing about these CIM data models is that they are additive. If your app tags an event as ‘mysyslog’, that won’t prevent another app from tagging that same event as ‘syslog2’.
And best of all? Searches based on either tag will work exactly as the author expects! In fact, your data might get some extra utility and value by conforming to the CIM data models which we publish. Splunk’s own Enterprise Security app, for example, leverages CIM highly in order to produce its dashboards and other knowledge objects. This means that if you have a product which, let’s say, has the concept of authentication, and you tag its authentication data with, you guessed it, the “authentication” tag, all of a sudden, that data is now showing up in your ES dashboards!
Ok, no more talking. Let me leave you with the slides and recording from .conf14, where Jack Coates and Brian Wooden presented the breakout session entitled, “Building a Common Information Model (CIM) Compliant Technical Add-on (TA)”.