Have you ever installed a Splunk Universal Forwarder and seen one or more of your Active Directory domain controllers have high CPU utilization as a result? Have you ever wondered how the Splunk Universal Forwarder translates the Security ID effortlessly into a real name you can read? In this blog post, I’m going to tell you exactly how we do the things we do with Active Directory and how you can improve the performance or reduce the load on your domain controllers.
There are two Windows pieces on the Universal Forwarder that deal with Active Directory. The first is known as admon – it emits information about your Active Directory Domain Services objects – both as a “dump” of the entire tree and to monitor for changes. Admon uses the common ldap_* API calls that Microsoft provides to both get a baseline of the Active Directory database and ongoing changes to that database.
The most common effect is that a large amount of memory is utilized during the baselining of the Active Directory domain. This is normally a problem in the context of the Active Directory app. In the case of this app, the admon is turned on for every single domain controller. In the case of a large database, you could see the splunk-admon.exe process balloon to several gigabytes on every single domain controller. However, we don’t actually need a copy of the admon changes on every single domain controller. We really only need it on one. If you are turning on admon purely for the Active Directory app, then you only need changes – not the baseline. For those larger sites, I recommend the following:
- Turn OFF the admon data input within the TA-DomainController-*
- Enable the admon data input on just ONE domain controller (you can even add a domain controller just for this task)
To turn off the admon data input, edit the TA-DomainController-*\local\inputs.conf (depending on which version of the addon you use) to include the following:
[admon://NearestDC] disabled = true
Then, on the one domain controller that you wish to enable the admon, add the following to a local\inputs.conf file that is on that domain controller:
[admon://ADMonitoring] targetDc = MyDC monitorSubtree = 1 baseline = 0 index = msad disabled = false
The targetDc value should be the NetBIOS host name for the domain controller that you are running admon on. We set baseline = 0 to not collect the baseline since we don’t need it. This will prevent admon from ballooning in memory utilization because of the large data capture. I know that some sites like fault tolerance in data collection, so feel free to add another domain controller to collect the duplicate data.
The other – not so obvious – place that touches the domain controller is in the Windows Event Logs. Windows Event Logs consist of two parts – a structured data piece and a localized template for showing the “friendly” version. You can see the structured data as XML in the Event Viewer – just click on an event to open it up, click on the Details tab and select XML View. Sometimes – particularly in the Security event log – you will see a Security ID in the meta-data (not in the message block). It’s something that starts with S-1-5-21-…. and it is this that gets translated by the WinEventLog modular input on the universal forwarder.
To do this, we call LookupAccountSidW() – an API call that looks up a record based on the SID and returns the details like the domain and the sAMAccountName of the object that it represents. There is an optional parameter – lpSystemName – to this API call that allows us to pass in a domain controller name. In the WinEventLog configuration, you can set the domain controller that this call queries using the evt_dc_name parameter. What is of interest is what happens when you don’t set the evt_dc_name parameter (which is the default). In this case, the SID resolution is first attempted by the local system. If the local system does not find the SID, then we move on to a domain controller trusted by the local system. Inevitably, this will first be the domain controller that the computer is logged into and if not there, then the closest domain controller (usually within the AD site) that holds a copy of the global catalog. There are some other code paths that crack open internal portions of the event data to do translations. In these cases, we first of all check whether we are bound to a domain controller (so the lookup should be directed there) – if the initial GUID or SID lookup fails, we use DsCrackNames() – again, if evt_dc_name points to a copy of the global catalog, we can translate anything in the forest, otherwise you are limited to the local domain.
When you first install the Splunk_TA_windows, there are a large number of security events to ingest already – usually numbering in the millions. That means that millions of SID translations are required before the windows event log catches up, causing a high load on the domain controller being queried for SID translations. You can alleviate this problem by not reading the backlog (utilizing the current_only parameter). In addition, you can specify that the local DC is used for SID translations (by utilizing the evt_dc_name parameter). For instance, let’s say I have a domain controller called DC1, I can use the following:
[WinEventLog://Security] evt_dc_name = \\DC1 current_only = 1
Since we are talking domain controllers here, what about the Active Directory app? The Active Directory app doesn’t actually use the SID translation. The username and domain of both the actor (who makes the changes) and the recipient (which object is changed) is embedded into the event without the need for SID translation. We can actually turn off the SID translation using the evt_resolve_ad_obj parameter like this:
[WinEventLog://Security] evt_dc_name = \\DC1 current_only = 1 evt_resolve_ad_obj = 0
You can make these changes in the Splunk_TA_windows\local\inputs.conf (creating it if the file does not exist). Doing this allows you to keep all the critical events from your domain controller without adding significant load to that or other domain controllers.