In my 8+ years here at Splunk, some questions from customers and the Splunk professional community are repeatedly asked year after year, and questions around syslog data and how to onboard it properly is a prime example. A key question that “refuses to die” is:
As an Admin, how do I easily ingest syslog data, at scale, while removing the requirement of up-front design work and “syslog-fu”?
We’re happy to announce that we can now properly answer that question via Splunk Connect for Syslog! Splunk Connect for Syslog was developed to lift the burden of syslog data collection off administrators, and provide a turnkey, scalable, and repeatable approach for syslog data ingestion.
Specifically, Splunk Connect for Syslog (SC4S) was designed to:
- Transport syslog data into Splunk at extremely high scale (> 5 TB/day from a single instance to multiple indexers)
- Properly categorize (sourcetype) for the most common data sources, with little to no custom configuration
- Provide enhanced data enrichment beyond the standard Splunk metadata of timestamp, host, source, and sourcetype
- Provide for additional custom-designed “filters” for additional sourcetypes beyond those supported out of the box
In the rest of this blog, we will explore the reasoning behind the design of SC4S as a follow-up to the 2017 blog, "Syslog-ng and HEC: Scalable Aggregated Data Collection in Splunk," and the lessons we have learned since then. We will also provide a general overview of the process used to configure this new product, but not in a “cookbook” manner. For full details, documentation is available on our documentation repository (and will be pointed out in the appropriate locations below).
But First, Some History...
Over two years ago, Ryan Faircloth and I began to explore a new architecture for syslog data ingestion, which utilizes the HTTP Event Collector (HEC) for data transport from a syslog server directly to Splunk. We were beginning to see the need for an alternative to the traditional approach — writing log messages to disk and from there to Splunk via UF – due to scale and ease-of-use concerns. Since the original “Syslog and HEC” blog post was written, the trends we began to see then have only accelerated.
Specifically, we have seen the following:
- Syslog has NOT been “deprecated” or become “legacy” for most device vendors. In fact, a major security device vendor has returned to syslog after a long departure.
- Syslog data volume has increased significantly due to enterprise growth and device throughput, requiring formalization of the techniques explored in the previous blog to cope with the volume.
- Syslog continues to be a majority data type, by volume, for nearly 100% of Splunk’s customer base.
- The two major syslog server flavors (syslog-ng and rsyslog) have responded by making critical enhancements to the destination support of both http (HEC) and Kafka.
- Customer demand for a turnkey, scalable solution to the problem has increased significantly.
As a result, it became clear that a formal project to bring to market a scalable, consistent, and easy-to-configure solution was warranted – and SC4S was born.
Syslog GDI Challenges
Though syslog is likely Splunk’s first data source, the unique challenges of getting syslog data onboarded into Splunk have not changed much over time.
These challenges include:
- Lack of documentation and support for best practices
- Shortage of deep syslog expertise in the community
- Inconsistency between syslog server deployments creates a support challenge
- Events from many data sources are tagged with the catch-all “sourcetype=syslog"; which limits usefulness of Splunk analytics
- Uneven data distribution between Splunk indexers; impacts search performance
Most of these challenges stem from two basic issues with respect to syslog data in general:
- Syslog is a protocol, not a sourcetype. Customers and Splunkers alike have often lumped this kind of data into one sourcetype (e.g. sourcetype=syslog), making further analysis with SPL and “schema at read” difficult because the protocol often carries multiple data formats from the various device vendors.
- Effective parsing of the protocol and appropriate “preconditioning” of the data prior to ingestion into Splunk requires the use of a syslog server – for which Splunk has not provided a prescription.
SC4S Design Goals
It was clear from our research that the “status quo” of syslog ingest into Splunk could not continue due to complexity and scale issues, and customer feedback showed continued challenges with the problem. Thus Splunk Connect for Syslog was born, and as the formal product gelled over the course of the summer, two overarching questions were continually asked during the development process:
- What solution would significantly improve on current practices for syslog?
- How can we meet the needs of 80+% of our customers out of the box?
We felt that if we brought the benefits below to the Splunk user community, then the solution would be viable. These benefits include:
- Lowering the burden, both on customers and Splunkers, of getting syslog data into the Splunk platform
- Providing a consistent, documented, and repeatable syslog collection infrastructure
- Providing turnkey data ingestion for 15 top sourcetypes at first release
- Improving the “data hygiene” of incoming syslog data with proper sourcetyping and enriched metadata
- Reducing Splunk overhead in processing syslog data
- Significantly enhancing scale and data distribution
Let’s examine each one of these benefits in turn.
Lowering the “GDI” Burden
First and foremost, we wanted to lower the burden on the entire community when bringing in syslog data. Unfortunately, Splunk has never offered an effective way to simply send the data directly to Splunk; indeed, if that is done a myriad of issues arise that will not be discussed here. Suffice to say: The practice is strongly discouraged, and rightly so. So, where does that leave the customer? The recommended practice of using a syslog server (one of two flavors – syslog-ng or rsyslog) meant that expertise was needed in their configuration. It was clear we needed to remove and/or significantly minimize the customer exposure to the inner workings of syslog servers.
Documented, Consistent, and Repeatable
Over the years there seem to be as many syslog collection architectures as snowflakes in December! A key goal was to make the configuration flexible enough for most, but consistent and repeatable for those who just want to “copy” what is there before. For this reason, we chose syslog-ng as the syslog server that serves as the foundation of SC4S due to its robust and straightforward (though not necessarily simple) syntax.
Turnkey for Top Data Sources
One of the key challenges in crafting syslog server configurations in varying enterprises is the creation of “filters”, or the unique parts of the configuration that parse for specific device types (which in turn get categorized into one or more related sourcetypes). A primary goal of SC4S was the creation of filters for the top devices we see in most enterprise. This helped us meet the primary goal of “turnkey” for most (but certainly not all) customers.
Improved Data Hygiene
In the process of developing filters for the various devices, the primary goal was to properly identify data from the device and assign it a sourcetype that would work with the (existing) TA on Splunkbase. This means it should be properly “sourcetyped,” with appropriate metadata (timestamp, host, source, and destination index) sent along with the message to Splunk. In addition to the basic metadata above, it was discovered early in the design process that far deeper data enrichment could be made available in the form of indexed fields. These fields could articulate whether the device is in PCI scope, is a part of a particular BU or geography, and many other categories depending on the use case. This capability far exceeds that which was previously available with the traditional UF transport.
Reduced Splunk Overhead
Reduced Splunk overhead was a major side effect of the transport work early in the project. When HEC was first explored as a data transport, only the “raw” endpoint was made available. Later, as significant enhancements were made to syslog-ng server, the “event” endpoint was made available, as well as the “batch” mode for bulk data transport. The combination of the event endpoint coupled with batch mode transport resulted in very low-overhead, scalable data ingest.
Scale and Data Distribution
Scale and data distribution were the historical early drivers of SC4S, as customers utilizing the traditional UF approach were running into difficulty with data distribution (where data was deposited unequally among a larger number of indexers) and scale in general. It was with this background that alternatives to the UF as a transport mechanism were explored. Earlier blogs and .conf talks highlight work done (and validated by current customers) that shows HEC as a viable transport alternative that, when properly configured, meets the needs of extremely high scale. In testing with the SC4S, performance of >5 TB/day on a single SC4S instance can be realized with 5 or more indexers. The earlier conf talk also highlights the exemplary data distribution to the indexers which results in far faster search, particularly with data model acceleration that is heavily used in ES and other settings.
Splunk Connect for Syslog Architecture
To realize the above benefits, SC4S is offered in two flavors: An OCI-compliant container for ease of deployment and OOTB functionality, and a “Bring Your own Environment” (BYOE) option for maximum flexibility at the expense of some turnkey features. Each environment is based on syslog-ng server software and encapsulates the same set of syslog-ng server configurations, while differing on the details of how syslog-ng itself is instantiated. Each utilizes the following high-level architecture:
Here are the highlights of each distribution:
- The SC4S “easy button”
- Includes the latest stable branch of Syslog-ng server
- All templating software included and automatically invoked at startup
- Local disk buffer for data resiliency
- Housed on the Red Hat Linux Universal Base Image (UBI) – super lightweight Linux designed for use in containers (the standard across all Splunk container projects)
- OCI-compliant and compatible with the runtime of your choice (Docker, Podman, k8s, etc.)
- For customers with bespoke requirements
- For customers unable to use containers for any reason
- Provides full access to underlying syslog-ng config files
o CentOS or RHEL 7.0 or later linux distro
o Latest syslog-ng; 3.24.1 as of this writing
o Latest gomplate templating software; version 3.5 or later (this is not necessary at runtime)
In each distribution, the underlying syslog-ng configuration is identical; the differences between container/BYOE lie solely in the customer’s preference (and limitations) as to runtime environments. Moreover, a customer utilizing a “BYOE” environment can build a custom container that includes local config elements, which would then be suitable for edge distribution by automated orchestration.
In part 2 of this blog, we will explore the high-level configuration of Splunk Connect for Syslog, which are fully documented in the resources below.
Splunk Connect for Syslog Resources
- Part 2: Splunk Connect for Syslog Configuration Overview
- Part 3: Custom Data Source Configuration Walkthrough (coming soon!)
- Support & Community Discussions: https://splunk-usergroups.slack.com #splunk-connect-for-syslog
- Issues or Enhancements: https://github.com/splunk/splunk-connect-for-syslog/issues
We wish you the best of success with SC4S. Get involved, try it out, ask questions, contribute new data sources, and make new friends!