For someone who has done her fair share of shopping, I really had no idea. Matty Settipane, one of our superstar Professional Services Consultants, posed this question to me recently so I did some experimenting.
A while ago, I published a guide on taming sourcetypes. It turns out, Scenario 4 is not as rare as originally thought. It is the case where a single stream of data contains multiple types of events. Splitting the stream into multiple sourcetypes requires the use of an index-time transform to rewrite the sourcetype metadata field. How much does each transform affect indexing performance? Is the degradation linear or exponential? Let’s find out.
This isn’t scientific, but serves as a fuzzy test. In lieu of a live stream of data or set of mixed events, I used a 515 MB sample log of Apache web access events. I shut down all the running programs on my MacBook Pro (yes, even chat) and set up a series of transforms to assign events to sourcetypes based on status code. Of course, there is no practical value for this configuration, but it does exercise the core functionality of applying transforms.
Configuration and Test Parameters
The configuration is simple. In props.conf, I setup a series of transforms. Splunk will apply every transform in the order listed when indexing each event.
SHOULD_LINEMERGE = false
TRANSFORMS-test = setSourcetype200,setSourcetype400,setSourcetype404 ...
All transforms are defined in transforms.conf with similar patterns:
REGEX = HTTP/1\.\d"\s+200
DEST_KEY = MetaData:Sourcetype
FORMAT = sourcetype::access_common_200
The same sample log file was indexed 5 times in order to compute the average time to index the file. This was done for each iteration. The parameters of the test:
Sample Size: 539,898,802 bytes (~515 MB)
Event Count: 1,594,405 events
Avg Event Size: 338.6208661 bytes
Presented below is the data charted using Splunk. You didn’t think I was going to use Excel for this, did you?
Here’s my armchair analysis: there is a noticeable slight downward trend with each added transform.
- There is no significant difference between automatic sourcetyping (no sourcetype set in inputs.conf) and manual sourcetyping with 0 transforms.
- The biggest difference is going from 0 to 1 transform.
- Each transform costs roughly 1% in performance.
- The data points don’t match the slope of a 1% downgrade exactly, but it’s somewhat close.
- The degrade is linear, not exponential.
- A 1% decrease in practical terms equates to several GBs less indexing capacity per day under a sustained EPS rate.
Considering Splunk 4.2 delivers 2x indexing capacity, a 1% throughput degrade is a nominal price to pay for sourcetypes whipped into submission. Keep in mind this analysis only applies to sourcetyping which requires transforms. Sourcetypes set in inputs.conf or props.conf are not included.
The nature of the regular expression in the transform does matter in this equation, but I wouldn’t expect those used for this test and real world configurations to differ too much or enough to affect the results significantly. But then again, this is an informal benchmarking exercise. I will defer more scientific testing to Splunk performance testing professionals. This is just something to chew on in case sourcetypes happen to appear on the Price Is Right.
P.S. If you are curious about the searches driving the table and graph above…
Effect of Transforms on Indexing Throughput
Each metric is computed using some simple conversion math on the test parameters.
> sourcetype=transformstest | eval eps=events/indexing_time | eval mbps=sample_size_bytes/indexing_time/1024/1024 | eval gbpd=eps*86400*(sample_size_bytes/events)/1024/1024/1024 | chart eval(avg(indexing_time)) as “Average Indexing Time” eval(avg(eps)) as “Average EPS” eval(avg(mbps)) as “MB/Second” eval(avg(gbpd)) as “GB/Day” eval(430.285968*(1-max(transforms)/100)) as “1% Decrease in GB/Day” by transforms | rename transforms as Transforms
Trend of Average Indexing Time
The linear trend is charted using a macro which takes 2 input parameters, fields to plot the x (# of transforms) and y (indexing time) coordinates.
> sourcetype=transformstest | `lineartrend(transforms,indexing_time)` | chart eval(avg(indexing_time)) as “Average Indexing Time” avg(newY) as “Linear Trend” by transforms
For details and how to configure the lineartrend macro, see the Splunk Community post Plotting a linear trendline.
Measuring the Indexing Time
First, find the first and last events in the file, then calculate the difference in the index times.
> sourcetype=”access_common*” (18.104.22.168 AND 43986) OR (22.214.171.124 31454) | stats range(_indextime)