
This is a pretty common question in Splunkland. Maybe you’re an admin wondering how much license you’ll need to handle this new data source you have in mind for a great new use case. Or you’re a Splunker trying to answer this question for a customer. Or a partner doing the same. Given how often this comes up, I thought I’d put together an overview of all the ways you can approximate how big a license you need, based on a set of data sources. This post brings together the accumulated wisdom of many of my fellow sales engineers, so before I begin, I’d like to thank them all for the insights and code they shared so willingly. Thank you for your help, everyone!
- Ask someone how much data there is
- Measure the data
- At the original source
- Outside the original data source
- Estimate the size of the data
- Bottom-up, based on “samples” of data from various sources
- Top-down, based on total data volumes of similar organizations
Let’s take a look at each of these in turn.
Ask someone how much data there is
Measure the data
Measuring at the source
Measuring outside the source
Get a trial license
Don’t touch it on the way in
Make sure that retention on the “_internal” indexes is long enough for the estimation period.
Throw it away afterwards
# Use a dummy index so you don't hose your main index in case you're doing this on an existing Splunk installation. [devnull] # Location of Hot/Warm Buckets homePath = $SPLUNK_DB/devnull/db # Location of Cold Buckets coldPath = $SPLUNK_DB/devnull/colddb # Location of Thawed Buckets thawedPath = $SPLUNK_DB/devnull/thaweddb # Maximum hot buckets that can exist per index. maxHotBuckets = 1 # The maximum size in MB for a hot DB to reach before a roll to warm is triggered. maxDataSize = 3 # The maximum number of warm buckets. maxWarmDBCount = 1 # Roll the indexed data to frozen after 1 day. (86400 = 60 secs * 60 min * 24hours ) frozenTimePeriodInSecs = 86400
Estimate the size of the data
Estimating based on data samples
- Logging severity levels. Say you’re talking about Cisco devices. There’s a big difference between the amount of information between the lowest (0 – emergency) and highest (7 -debugging, or more likely 6 – information) levels. Are you logging simply which flows are allowed/denied, or going up the chain to user activity?
- What the device is doing. Logs are a reflection of work performed. If this is a firewall, how much traffic is it seeing? The more it does, the more it will log, even if the severity level stays the same. Simply knowing the number of firewalls you have isn’t enough, though that’s often the first piece of information that comes back. How many flows are they seeing per second? Couple this with something like a netflow calculator if you’re looking at netflow, and you start to get somewhere.
- Services enabled.
- Types of events that are being logged. (Side note: Are you logging everything you should to help you in case of a security breach? Check out some excellent guidance from the NSA, and Malware Archaeology’s great talk at the 2015 Splunk .conf user conference.
- How many things get logged. Say it’s Tripwire data. Change notifications depend on how many Tripwire agents there are in the environment and what rules are set to fire against them.
- What’s IN a thing that gets logged. Say it’s Tripwire again. If you’re doing a lot with Windows GPO changes or AD changes, your change notifications get bigger.
- Custom logs. You can add your own pieces to existing logs sometimes, such as with a BlueCoat proxy.
- How long you’re measuring for. Volumes can ebb and flow over the course of a week.
- Where the data is coming from. Some data sources are just more talkative than others. Carbon Black, for example.
Estimating based on the usage of similar-sized organizations
Appendix: Searches for measuring data ingested
**************************************************************************** Using Raw Data Sizing and Custom Search Base These searches use the len Splunk Search command to get the size of the raw event using a custom base search for specific type of data. **************************************************************************** NOTE: Just replace "EventCode" and "sourcetype" with values corresponding to the type of data that you are looking to measure. ===================================================== Simple Searches: ===================================================== Indexed Raw Data Size by host By Day: ------------------------------------- sourcetype=WinEventLog:* | fields _raw, _time, host | eval evt_bytes = len(_raw) | timechart span=1d sum(eval(evt_bytes/1024/1024)) AS TotalMB by host Indexed Raw Data Size by sourcetype By Day: ------------------------------------------- sourcetype=WinEventLog:* | fields _raw, _time, sourcetype | eval evt_bytes = len(_raw) | timechart span=1d sum(eval(evt_bytes/1024/1024)) AS TotalMB by sourcetype Indexed Raw Data Size by Windows EventCode By Day: -------------------------------------------------- sourcetype=WinEventLog:* | fields _raw, _time, EventCode | eval evt_bytes = len(_raw) | timechart span=1d limit=10 sum(eval(evt_bytes/1024/1024)) AS TotalMB byEventCode useother=false Avg Event count/day, Avg bytes/day and Avg event size by sourcetype: -------------------------------------------------------------------- index=_internal kb group="per_sourcetype_thruput" | eval B = round((kb*1024),2) | stats sum(ev) as count, sum(B) as B by series, date_mday | eval aes = (B/count) | stats avg(count) as AC, avg(B) as AB, avg(aes) as AES by series | eval AB = round(AB,0) | eval AC = round(AC,0) | eval AES = round(AES,2) | rename AB as "Avg bytes/day", AC as "Avg events/day", AES as "Avg event size" Avg Event count/day, Avg bytes/day and Avg event size by source: ---------------------------------------------------------------- index=_internal kb group="per_source_thruput" | eval B = round((kb*1024),2) | stats sum(ev) as count, sum(B) as B by series, date_mday | eval aes = (B/count) | stats avg(count) as AC, avg(B) as AB, avg(aes) as AES by series | eval AB = round(AB,0) | eval AC = round(AC,0) | eval AES = round(AES,2) | rename AB as "Avg bytes/day", AC as "Avg events/day", AES as "Avg event size” ===================================================== Combined Hosts and Sourcetypes: ===================================================== Top 10 hosts and Top 5 sourcetypes for each host by Day: -------------------------------------------------------- sourcetype=WinEventLog:* | fields _raw, _time, host, sourcetype | eval evt_bytes = len(_raw) | eval day_period=strftime(_time, "%m/%d/%Y") | stats sum(evt_bytes) AS TotalMB, count AS Total_Events by day_period,host,sourcetype | sort day_period | eval TotalMB=round(TotalMB/1024/1024,4) | eval Total_Events_st=tostring(Total_Events,"commas") | eval comb="| - (".round(TotalMB,2)." MB) for ".sourcetype." data" | sort -TotalMB | stats list(comb) AS subcomb, sum(TotalMB) AS TotalMB by host, day_period | eval subcomb=mvindex(subcomb,0,4) | mvcombine subcomb | sort -TotalMB | eval endcomb="|".host." (Total - ".round(TotalMB,2)."MB):".subcomb | stats sum(TotalMB) AS Daily_Size_Total, list(endcomb) AS Details by day_period | eval Daily_Size_Total=round(Daily_Size_Total,2) | eval Details=mvindex(Details,0,9) | makemv delim="|" Details | sort-day_period Top 10 Hosts and Top 5 Windows Event IDs by Day: -------------------------------------------------------- sourcetype=WinEventLog:* | fields _raw, _time, host, EventCode | eval evt_bytes = len(_raw) | eval day_period=strftime(_time, "%m/%d/%Y") | stats sum(evt_bytes) AS TotalMB, count AS Total_Events by day_period,host,EventCode | sort day_period | eval TotalMB=round(TotalMB/1024/1024,4) | eval Total_Events_st=tostring(Total_Events,"commas") | eval comb="| - (".round(TotalMB,2)." MB) for EventID- ".EventCode." data" | sort -TotalMB | stats list(comb) AS subcomb, sum(TotalMB) AS TotalMB by host, day_period | eval subcomb=mvindex(subcomb,0,4) | mvcombine subcomb | sort -TotalMB | eval endcomb="|".host." (Total - ".round(TotalMB,2)."MB):".subcomb | stats sum(TotalMB) AS Daily_Size_Total, list(endcomb) AS Details by day_period | eval Daily_Size_Total=round(Daily_Size_Total,2) | eval Details=mvindex(Details,0,9) | makemv delim="|" Details | sort-day_period ************************************************************************** Licensing/Storage Metrics Source The below searches look against the internally collected licensing/metrics logs and introspection index. These are in license_usage.log, which is indexed into the _internal index. ************************************************************************** ============================================ Splunk Index License Size Analysis ============================================ Percent used by each index: --------------------------- index=_internal source=*license_usage.log type=Usage | fields idx, b | rename idx AS index_name | stats sum(eval(b/1024/1024)) as Total_MB by index_name | eventstats sum(Total_MB) as Overall_Total_MB | sort -Total_MB | eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%" | eval Total_MB = tostring(round(Total_MB,2),"commas") | eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas") | table index_name, Percent_Of_Total, Total_MB, Overall_Total_MB Total MB by index, Day – Timechart: ----------------------------------- index=_internal source=*license_usage.log type=Usage | fields idx, b | rename idx as index_name | timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by index_name ============================================ Splunk Sourcetype License Size Analysis ============================================ Percent used by each sourcetype: ------------------------------------------- index=_internal source=*license_usage.log type=Usage | fields st, b | rename st AS sourcetype_name | stats sum(eval(b/1024/1024)) as Total_MB by sourcetype_name | eventstats sum(Total_MB) as Overall_Total_MB | sort -Total_MB | eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%" | eval Total_MB = tostring(round(Total_MB,2),"commas") | eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas") | table sourcetype_name, Percent_Of_Total, Total_MB, Overall_Total_MB Total MB by sourcetype, Day – Timechart: ------------------------------------------- index=_internal source=*license_usage.log type=Usage | fields st, b | rename st as sourcetype_name | timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by sourcetype_name ============================================ Splunk host License Size Analysis ============================================ Percent used by each index: ------------------------------------------- index=_internal source=*license_usage.log type=Usage | fields h, b | rename h AS host_name | stats sum(eval(b/1024/1024)) as Total_MB by host_name | eventstats sum(Total_MB) as Overall_Total_MB | sort -Total_MB | eval Percent_Of_Total=round(Total_MB/Overall_Total_MB*100,2)."%" | eval Total_MB = tostring(round(Total_MB,2),"commas") | eval Overall_Total_MB = tostring(round(Overall_Total_MB,2),"commas") | table host_name, Percent_Of_Total, Total_MB, Overall_Total_MB Total MB by host, Day – Timechart: ------------------------------------------- index=_internal source=*license_usage.log type=Usage | fields h, b | rename h as host_name | timechart span=1d limit=20 sum(eval(round(b/1024/1024,4))) AS Total_MB by host_name ============================================ Splunk Index Storage Size Analysis ============================================ Storage Size used by each non-internal index: ------------------------------------------- index=_introspection component=Indexes NOT(data.name="Value*" OR data.name="summary" OR data.name="_*") | eval data.total_size = 'data.total_size' / 1024 | timechart span=1d limit=10 max("data.total_size") by data.name Storage Size used by each internal index: ------------------------------------------- index=_introspection component=Indexes (data.name="Value*" OR data.name="summary" OR data.name="_*") | eval data.total_size = 'data.total_size' / 1024 | timechart span=1d limit=10 max("data.total_size") by data.name
----------------------------------------------------
Thanks!
Vishal Nakra