In prior blogs, I talked about Splunk’s native Open Cybersecurity Schema Framework (OCSF) support for Amazon Security Lake, where the data was already in OCSF form. Now, Splunk Edge Processor — a feature of Splunk’s Data Management suite of solutions — tackles another aspect of OCSF support: translating raw non-OCSF data sources into Splunk OCSF sourcetypes.
But first, if you haven’t been aware of it until now, Splunk Edge Processor is a relatively new feature set of Splunk Cloud Platform that, among other things, provides a built-in way to process data streams using the SPL2 language, and create pipelines and routes to multiple destinations. Those destinations of course include Splunk indexers, but also external data lakes. Currently, Edge Processor has support for AWS S3, but upcoming releases will also include support for Azure destinations, such as ABS and ADLS, and other storage options.
With the latest releases of Edge Processor (EP) and Splunk Cloud [9.2.2406 and higher], there are two new SPL2 commands, ocsf and to_ocsf. More on these later.
Lastly, I get a lot of questions about Splunk’s Common Information Model and how it relates to OCSF, which to use when, and other questions. I’ll explain my view on this, but I also want to mention here that there is an OCSF to CIM technical add-on freely available on Splunkbase. What it does and how it fits into Splunk Enterprise Security will also be explained.
From Splunk documentation: “Edge Processor is a Splunk product that allows you to process data using SPL2 before you send that data out of your network to external destinations. You use a Splunk-managed cloud service to deploy and manage on-premises Edge Processors at the edge of your network.”
In short, EP allows you to collect, process, and route data via pipelines from your data sources to Splunk indexes in your network or cloud, to your Splunk Cloud service, or to external storage such as Amazon S3. The diagram below illustrates this.
Data sources can be sent to EP via any of the Splunk forwarding methods, e.g., Universal Forwarders and Heavyweight Forwarders, HTTP Event Collector (HEC) , or syslog. If those sources are already in OCSF form, you can use EP to do additional processing and routing via one or more pipelines. If the data is not in OCSF form, EP’s new SPL2 OCSF functions can do supported translations.
Note that Splunk uses the term “field” to refer to a property name/value pair extracted from raw data (or added to raw data, like the Splunk default fields such as sourcetype). A property in Splunk is a field name and field value. Field values can be alphanumeric strings or integers.
OCSF uses the term “attribute” to refer to a name defined in its framework dictionary, whose values can be of many different OCSF data types, including OCSF object types. OCSF is agnostic to encodings but with EP, these attribute names are property names in a JSON event, which can have scalar or array values, or can be OCSF objects whose values become JSON objects. For data lake storage or wire protocols, attribute names are column names in a Parquet file, or field names in a Protocol Buffer (protbuf). OCSF attributes can become extracted Splunk fields when stored in a Splunk index.
Splunk SPL2 streaming commands operate on a stream of events resulting from a search or as they pass through a pipeline. SPL2 generative commands create the set of events that will be streamed and processed. search or from are examples of common generative commands, while eval is a streaming command.
ocsf is a streaming command while to_ocsf is an eval function, a function used inside an eval streaming command; they both process data in the stream on an event-by-event basis.
Splunk’s sourcetype is a default field added to the raw data that is a combination of the classic source and type classification of data. It is indexed at ingest time and identifies the origin of the data as well as its format. This is very important for the OCSF translations, as each distinct sourcetype has a specific translation mapping.
If your data sources didn’t already conform to the OCSF schema, until now Splunk didn’t provide a way to convert those sources to OCSF form. There are a number of tools out there these days that can do a good job of that, and customers already use them to translate data to OCSF. Splunk has open-sourced our own Java tools that can be used as well.
Splunk has developed a number of popular translations using this open format. With EP, we are now making those translations built-in and available to customers of Splunk, in case they don’t have other tools that are already doing the processing.
We added two OCSF functions to SPL2: ocsf and to_ocsf. Both of these functions translate raw data sources to OCSF form but in slightly different ways, depending on your pipeline processing requirements. Documentation for these functions can be found here: “Convert data to OCSF format using an Edge Processor.” These functions will produce OCSF-formatted events encoded as JSON from a number of built-in event sources. More event sources will be added every month with EP updates.
Edge Processor pipeline creation starts by selecting a sourcetype for the original input data. From the sourcetype, EP can determine what OCSF event class the input events should be translated to, then parse the data, and then do the mapping in stream, outputting the translated events in JSON. Other SPL2 commands can further process the events, and ultimately determine the destination to where the pipeline delivers the events.
The simplest approach to translate raw event sources to OCSF in EP is to use the ocsf streaming command, as it does much of the work for you. It should work for most use cases. The ocsf command translates the stream of input events in an EP pipeline in place: each Splunk _raw field enters the process with the raw sourcetype field, and exits the process with the translated OCSF JSON formatted data as a single Splunk field. The original raw data sourcetype field is modified to prepend ocsf: to distinguish the format of the source’s translated data from its original raw format. The source is the same but the type of the event is different. One benefit of prepending to the original sourcetype is that searching for all data in the OCSF format is more efficient, since Splunk’s partial string search is much faster when starting at the beginning of a string.
There are a few options to the command that can be used which are notable. You can keep a copy of the original data in the OCSF event’s raw_data attribute value. Alternatively, you can use the thru command and save a copy of the original raw data to specified pipeline destination. The ocsf command can automatically surface OCSF observables attribute from the translated OCSF event, which is a flat array of the most interesting attribute values from anywhere in the event. You can also automatically populate OCSF enum sibling names per the schema definitions, so you can use string labels instead of integers in your searches.
Here is an example using the ocsf SPL2 command, modified from this documentation. EP uses the thru command to split and route the input data to S3, stream the same input data to the translation command based on the sourcetype of $source, and route it to a Splunk index. Explicitly setting the sourcetype is usually not necessary but is shown here for illustration. The resulting will be “ocsf:cisco:asa.”
import ocsf from /splunk.ingest.commands $pipeline = | from $source | thru [ | into $S3 ] | ocsf sourcetype="cisco:asa" include_raw=false add_enum_siblings=true add_observables=true | into $ocsf_index;
In case you don’t want to translate the events in place (i.e., overwrite _raw) or the input event is not in the _raw field, the to_ocsf eval function extracts the fields from a specified field, translates the source event data, and creates a new event formatted as OCSF JSON. The _raw field is not modified, nor is the raw data’s sourcetype. The function does not automatically set a new sourcetype for the resultant OCSF event data; you need to use an SPL2 eval command to prepend ocsf: as part of a new output sourcetype name. to_ocsf can also translate source event data not in the _raw field, letting you specify the field name. Most users won’t need this level of control though.
Because to_ocsf is an eval function rather than a streaming command itself, it takes arguments that equate to the options of the ocsf streaming command:
| eval = to_ocsf(, , , , )
The last three arguments are booleans, corresponding to the options to:
Here is an example of using the to_ocsf SPL2 eval function. In this example, the pipeline isn’t split into two destinations but all sent to the configured S3 bucket, and the raw input event is included with each OCSF translated event. The data is pulled from the internal _raw Splunk field, but since EP will parse the data in stream, any extracted field could have been used from the original raw events and that field name would be passed as the first parameter. A new conforming sourcetype value is constructed and will be the sourcetype for the OCSF output events.
import to_ocsf from /splunk.ingest.commands $pipeline = | from $source | eval ocsf_formatted_data = to_ocsf(_raw, "cisco:asa", true, true, true) | eval sourcetype = "ocsf:cisco:asa" | into $S3;
Note that eval functions also support named parameters:
import to_ocsf from /splunk.ingest.commands $pipeline = | from $source | eval ocsf_formatted_data = to_ocsf(_raw, "cisco:asa", includ_raw=true, add_enum_siblings=true, add_observables=true) | eval sourcetype = "ocsf:cisco:asa" | into $S3;
First, let’s cover some background on Splunk data models and the Common Information Model (CIM).
Data models are a form of knowledge objects specific to Splunk. They aren’t schemas in the pure sense, but they have schema characteristics that are very beneficial to certain Splunk functions, including field normalization across event sources and detection rules. Datasets are hierarchical subsets of fields in a data model.
CIM is a set of data models geared towards detection rule logic that can be done at search time, no matter what form of raw data is stored in Splunk indexes. CIM delivers data models such as Authentication, Alert, Network, Malware, Endpoint and others via the CIM technical add-on available on Splunkbase.
To ingest data that support CIM data models there needs to be a technical add-on (TA) for each data source, or Splunk sourcetype, which not only identifies and collects the data, but also extracts the fields, translates, and aliases them to the desired datasets within appropriate CIM models. The extraction and transformation is usually done at search time, via workloads running on Splunk search heads. You can think of this loosely as the analog to the EP pipeline translation process. Third parties can write their own TAs with CIM mappings.
The data model subsystem of Splunk is aware of the data models that are available, and can automatically populate the .tsidx portion of the data model indexes repeatedly as new data arrives in an index, a process called data model acceleration (DMA).
In particular, when data models are accelerated, the extracted, translated fields are serialized to the Splunk index tsidx files to improve search performance, as if the data adhered to a schema when ingested. Special SPL search expressions can take advantage of the tsidx key-value storage of the extracted event fields. Note that the internal and default Splunk fields such as sourcetype, source, host, index, and timestamp are extracted at index-time.
Splunk Enterprise Security includes the CIM app as it relies on CIM for dashboards and its Asset & Identity framework. The Splunk ES Content Update app has detection rules that depend on CIM data models too, so both apps enhance and extend the capabilities of the Enterprise Security product.
OCSF events, whether directly ingested into Splunk or via EP translations, look like just another sourcetype. For backwards compatibility with Splunk products such as Splunk Enterprise Security, they can be translated to conform to CIM data models using the free OCSF-CIM add-on, and can be accelerated for better performance of Enterprise Security detection rules and built-in dashboards.
The OCSF-CIM Add-On is aware of the sourcetype convention where the sourcetype starts with ocsf:. This is important because that is how the configuration page of the app finds compatible sourcetypes to translate to CIM models. There is also an extra indexer stanza needed in your props.conf file that you can find in the documentation.
There are some limitations on the mappings, since OCSF events are high-fidelity, complex events while each CIM data model is a narrow aspect of an event for the purposes of particular detection rules or dashboards. For example, you need to choose one of the data models, when OCSF events might traverse multiple CIM data models. An example would be a Zeek alert, where network and alert information are sent via one of the OCSF Network Activity classes with the Security Control profile applied and either the Network or Alert data model must be selected.
You can find more information about the OCSF-CIM Add-On for Splunk at “Working with OCSF formatted data in the Splunk platform and Splunk Enterprise Security,” and you can find it for download on Splunkbase.
Whether you are using external tools to do OCSF translations or EP’s new OCSF streaming command and eval function, you might still want to use EP to do additional processing and routing within Splunk — say to different indexes based on criteria, via pipelines, or to split destinations for high value data going to Splunk indexes — while other data can be stored in external cloud storage for data retention, threat hunting, and investigation. If you route data to S3, you can use Federated Search for Amazon S3 to search that data directly from Splunk.
The number of built-in event sources are limited to those documented in “Supported source types and event types,” and more will be added each month with EP updates. The advantage of EP’s built-in translations is they work out of the box without you having to do any of the translations. You can also write your own translations with the open-source Java tool mentioned above, or popular 3rd party products and tools now on the market.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.