Notes on Splunk CIM

So you want to work with the Splunk Common Information Model, and you’re not sure where to start… developers first working with the CIM and Add-ons are sometimes confused by its minimalist design, particularly if they’re familiar with the broadly used Desktop Management Task Force CIM. Here’s some notes on the CIM’s design that hopefully will help clear things up. First, we’ll look at how it’s used, and then we’ll talk about why the Splunk CIM is designed the way that it is.

The Splunk CIM describes concepts via tags rather than entities via database columns, and the first thing to understand when you’re trying to work with it is the event type. Events are the raw material that we work with, including metric measurements or inventory reports, which are just another type of event to Splunk. Events don’t necessarily have neat definitions of what it all means though, so we use the event type to recognize events and tags to label them. The CIM’s data models then read tagged events and collect recognizable fields to model so that an app can more easily use these events. It’s also important to realize that a given data source might involve lots of data models. For instance, a network firewall will of course produce events for Network Traffic, but it’s also got something to say to Authentication, Change Analysis, Inventory, and Performance models. And if it’s a modern firewall with deep packet inspection capabilities, it’s probably got awareness that’s useful to the Web and Malware models too. You can follow this idea as deeply or shallowly as you need… for instance the firewall may generate alerts, but that doesn’t mean you have to model them to the Alerts data model in CIM if you don’t want. The takeaway here is that the Splunk CIM doesn’t look at a data source and build a data model to describe that source. Instead, it describes a loose set of concepts and lets people model their data to those concepts where it makes sense.

Let’s look at this concept more closely as an app developer. If I want to see something in an app that I’m building, I start at a use case, which depends on a series of ideas:

  1. obviously, I need a use case — the report, correlation search, key indicator, or alert that I want my user to see. This can be done in a lot of ways, but the easiest and most maintainable option is to write a search that tests against structured data for an expected outcome.
  2. that means the next thing I’ll want is a data model, or a structured representation of the attributes and objects that the use case is going to test. As David Wheeler wrote, “All problems in computer science can be solved by another level of indirection.” The extra effort of abstracting raw data to a data model is worthwhile because it provides a stable interface, so that I can deal with changes in the raw data by having an Add-on or set of Add-ons apply Splunk’s late-binding schema to the data sources. It also means that supporting the app is easier, because I can separate problems of “getting data in” from problems of “looking at the data”.
  3. so, I also need an Add-on, or a set of searches and regular expressions that will tag and name the raw data for use in the data model. Luckily there are lots of these on, and they’re easy to build as well.
  4. Technically I’m all done solving my problem now, but in Splunk supported apps we also write an eventgen config and a unit test, using samples of data that represent the condition that the use case is looking for. That way we know immediately when a logic error or platform change causes something to break.

When one first looks at this stack of action items, it’s easy to think that a given data model, such as Network Traffic, must therefore include every attribute that the use case might ever want. For instance, if I am doing a Cisco Security panel, I will want to report on the dozens of complex decisions that Cisco gear has made, not the simplistic Network Traffic model. Expanding the Network Traffic model to cover all the attributes available from all sources of network traffic would be missing the design goal of the CIM, in my opinion. The Splunk CIM is a least common denominator of all network traffic sources — it’s simplistic, and therefore slim. It’s easily used by people who aren’t deeply familiar with the underlying technologies, and because it’s built on Splunk, this least common denominator approach doesn’t lose anything from the rich raw data sources. In other words, CIM’s Network Traffic model can be used to understand Stream captures, NetFlow data, Cisco logs, and Check Point logs at the same time, and any interesting indicators can be followed up by drilling into the full fidelity data.

Of course, there are tradeoffs in using a least common denominator model, and it might be instructive to review a different approach such as the Desktop Management Task Force Common Information Model mentioned at the beginning of this post. The DMTF CIM is different from ours in that it’s hierarchical in nature, like a traditional database. In grossly simplified terms, there’s a master entity node like “computer” or “asset”, and then sub-nodes which describe things like “network” or “cdrom”. At the top is the ultimate node that everything inherits from, and everything must fit into an a priori entity model before you can work with it.

This means that each section of the DMTF CIM model is very complex. Quoting from the DMTF CIM tutorial: “The Device Model will not be reviewed in its entirety – because its scope is too large, addressing all the various aspects of hardware functionality, configuration and state. In fact, the Device Model can be broken down to individual components (cooling and power, processors, storage, etc.) that are managed individually.” This is necessary because most possible configurations have to be defined in the structure, and it can easily lead to implementations that contain leftover concepts like columns that describe the type of floppy drive installed in your servers.

By describing the least common denominator of concepts in individual models containing loosely structured knowledge objects, the Splunk CIM can keep it very simple and stupid, and our equivalent of Device Model is described in a single web page. In other words, we shift work from the model architect to the application developer because it allows that developer greater flexibility. By keeping the complexity down and using datamodels as pointers to raw data, we keep performance up, which is a good thing for everyone.

Jack Coates

Posted by