
From an operating-system perspective, splunk is a system of programs that work together to provide the utility that users experience. Each of these programs have their own memory use patterns, and having some idea of them is good for investigating memory exhaustion/performance problems, as well as resource planning.
The involved parties in the splunk memory picture are:
- the operating system
- splunkweb
- splunkd
Programs launched by splunkd:
- splunk-search
- python search processors
- splunk-optimize
- scripted inputs such as wmi, imap, regmon, admon, vmware, imap, or your own customized/created agents
- scripted alerts
- scripted index management scripts (warmtocold, coldtofrozen)
- scripted auth
Many of these (especially the scripts) are largely external to splunk, in that splunkd runs them as requested, but their resource consumption is up to third party authors, external designs, or external factors. The size of these tools will not be covered in great detail.
Operating system
The operating system is expected to provide an efficient data cache for splunk data files, including:
- splunk binaries
- web assets
- config files
- indexed data files
- input log files
- etc
Since memory access is several orders of magnitude faster than disk access, a healthy splunk system should have a significant of memory un-allocated by any process at most times. A good ballpark ratio is half or more of the ram free for cacheing purposes. A corrolary is your operating system should be making use of all your memory.
General memory info
When measuring the memory use of actual programs, always remember to review the real memory usage, not the virtual. Real memory usage is sometimes called “RSS”, “RSIZE”, “in core”. On windows the closest approximation is “Private working set”. This can be a bit misleading, as a system under hevy memory pressure will page out more of the memory allocated to programs. Therefore it’s best to first get a sense of overall system memory pressure before reviewing process sizes.
(There are other misleading factors — it’s generally a bad idea to measure dissimilar programs simply by RSIZE to guage their ‘bloat factor’. If you care about this sort of thing you might be interested in smem : http://www.selenic.com/smem/ on Linux )
Splunk Web, or the python process does need to buffer the data being fed immediately to the the browser. For the most part, the ram requirements are modest (tens to perhaps 100 MB) , but there are patterns that can push it up.
If you are displaying 50 items a page, splunkweb will have to acquire 50 items in an xml document from splunkd and then render a an html fragment with these 50 items. Normally this isn’t very large, and the default document trims them to a fixed number of lines (to avoid breaking the browser). However for odd cases (events containing lines that are tens or hundreds of kilobytes long) this could become significant per client.
Another example would be a case where you request display of the top 10,000 hostnames based on event quantity. splunkd will need to generate an xml document with 10k stanzas, which python will have to load and parse, and then generate an html entry with same.
Thus large display cases, times user concurrency, will cause splunkweb to expectedly grow. For so-called ‘pathological’ situations I’ve seen splunkweb grow by 200-300MB for one user.
splunkd has a few tasks in parallel:
- reading in data from various inputs
- processing data prior to indexing
- building indexed datastructures
- launching search requests and providing results, both interactively and scheduled.
- authenticating users
- possibly sending data outbound to other systems.
While all these tasks use memory, there are a few that dominate.
program baseline
splunkd is a big program. The program text itself will use some 30MB or more.
pipeline data
All the data flows of pre-indexed data to the index on disk or to network outputs live in memory. Typically for both forwarders and indexers, this data is some tens of megabytes. On an indexer, the data size is proportional to event size. Thus if you have a majority of very large events (java exception backtraces, web page documents) then this data will grow proportionally.
Pipeline data can grow sharply when the system is not able to keep up with the dataflow for some reason. An extremely underutilitized system will have 1-2 events in each FIFO queue, while a system that is behind will fill up to 1000 events in each FIFO queue. Thus you can grow from ~1MB of pipeline data to more like 20-40MB of pipeline data quickly in situations like disk bandwidth exhaustion, or a blocked downstream splunk instance.
index structures
As part of making the data searchable, an index is built for it. This is built in memory and then flushed out when the memory buffer is full. Each index has an independent buffer.
In Splunk 3, the default per-index buffer was 10MB, while the default index buffer was 100MB. Typically adding more indexes with significant volume would have similarly large buffers, so a high volume server with two user-data indexes might have around 200+MB for indexing buffers.
In Splunk 4.0, the default per-index buffer is 5MB, while the default for the main user-data index is 20MB. A similar example on Splunk 4 would be more like 30-40MB for indexing buffers.
In both 3 and 4, if the number of indexing threads goes up, additional buffers are allocated for these additional threads. We strongly do not recommend adjusting the number of threads.
ldap authentication data
In splunk 3.x and 4.0.x, the responses to the defined LDAP searches that gather user information and group information is buffered in ram. In some cases, this can be quite large. Ideally these searches should be tuned to narrow the data down to the necesssary data. Splunk 4.1 will not buffer significant LDAP data.
searches
In splunk 3, searches live in splunkd ram. Approximately 100k events will result in memory allocations on the order of 1GB.
In splunk 4, the only significant memory use for search will be generating xml descriptions of events. For splunkweb and well-behaved REST clients, this will be very small. It’s possible for a poorly behaved REST client to request extremely large documents which will kick this up.
splunk-search
Splunk-search (4.x+) runs all the operations requested by the search expression, including pulling data off disk, adding fields, sorting, timecharts, and so on. Some operations, like deduping can use significant memory for large numbers of events, while simple search does not. Thus, searches will vary from some tens of megabytes to multiple gigabytes.
If you have memory concerns about your expensive searches it is best to try them and measure using top, ps, etc.
Obviously, you have to consider the quota of searches configured, and the likely overlap of expensive searches by user patterns.
Search processors
In addition to search processors that run natively inside the splunk-search executable some search processors are written in python, and will be spawned as externel processes. Typically these are quite small, but if you have added processors of your own design they may be significant. Ideally these do not buffer any significant amount of data, but just read and write records as they go.
splunk-optimize
From a memory perspective, splunk-optimize is usually a red herring. It looks big but its real footprint is far below that.
Splunk-optimize has the task of combining small .tsidx files (bucket components) into large ones. Depending upon the files combined, the resources can very from extremely little to significant.
splunk-optimize maps the index files into memory, so the virtual size of this program will appear to be quite large. It then walks the source files in essentially linear order, faulting all of the files into the process space. However, since the memory access patterns are so linear, there will be little effective memory pressure produced by splunk-optimize, so the footprint should decrease dramatically when memory is tighter.
The rest of the tasks, including the various scripts, data gathering programs, alerting programs, archiving scripts are genearlly not significant. There are some notable exceptions:
- The 3.x vmware app. Written in Java, it’s a bit large, over 1 GB of ram typically.
- flatfileexport.sh – this coldtofrozen archive script invokes ‘exporttool’ which can be fairly memory hungry for 64bit buckets. It may take as much as 2.5GB of ram.
- splunk-wmi – largely as a result of the Windows WMI subsystem that this program uses, the memory use of this tool grows with the number of categories it is pulling and with the number of hosts. Thus this growth can be a problem if you gather data from a very large number of hosts, or if you have, for example, a large number of custom eventlog categories, or both.
----------------------------------------------------
Thanks!
Joshua Rodman