In my last blog entry, I wrote about asking vendors to make their log event formats follow industry best practices. Now, if the log events reside in files or can be broadcast out on network ports, this makes it quite easy to access them with technologies such as Splunk Universal Forwarders. However, if the log events are buried deep within the application, device, or system that created them, then there is is one more issue to address to get to the events and that is having an accessible transport mechanism with examples on usage.
By transport, I obviously am not referring to some futuristic vehicle transportation.
What I am talking about is a way for one computer process to get to data using some type of acceptable protocol with an easy to use API so that the transport mechanism is a standard part of log event accessibility within networks.
This means for data that is not necessarily in files or available on listening TCP or UDP ports (e.g., syslog), connectivity to the machines or devices to access the data, API’s to gather it, and examples on how to do this should be part of every vendor’s playbook on accessing log events.
In my last conversation with the Splunk Ninja, Michael Wilde, he mentioned that it would be great if all vendors can make their log event data available via HTTP (or HTTPS) GET or POST calls. Since Michael likes to ingest Twitter feeds via their REST API returning events in JSON format, it obviously follows that he would like this to be a more general practice, as it is easy and follows a standard that everyone can understand. (What’s nice about Twitter is they return an ID number for a request and in the next request, provided you give the ID number, you will now get new events since the last request.)
The point of this article is that we should be asking vendors that produce log events that are not in files or streamed on network ports to make their event data easily accessible via a transport, protocol, API, and examples. After all, what good are log events that follow best practices, if they are not accessible? I’ll list some common examples I’ve seen for accessing log events over a transport.
As alluded to before, this is one of the more popular ways to gather Big Data. My radio stations, Youtube, Twitter, and general RSS feeds apps and add-ons on Splunkbase all rely on this mechanism. If a vendor can simply provide standard HTTP GET or HTTP POST access to their log events, ingesting them within a Big Data engine becomes easy. Furthermore, if an ID can be returned so that only newer events are returned upon each request such as the Twitter example, that makes it even better.
If the vendor is pushing data over HTTP (rather than us having to pull it), I’ve also provided a reference implementation in the past using a Servlet to listen for this data and use Log4J to output individual events.
Queues are meant to to have producers and consumers. If a vendor can publish log events to accessible queues, we can all create clients to consume these events to be used for further processing such as indexing them within Splunk. Examples of well known queuing mechanisms are JMS and MQ Series.
Here is a transport that is the not discussed with collecting log events, but it should be a very important way to collect heartbeats from vendor products. The idea is that you would listen on a multicast port to ingest events for further processing. The lack of events for a threshold of time may indicate that there is something wrong with the infrastructure, which may lead for you to set up an alert within Splunk to take corrective action.
A vendor may expose their log events or time series data via web services. As long as the service is accessible, provides a WSDL, and examples on how to use it, this provides a standard mechanism to gather the data. A few years ago, in my first few months at Splunk, I had proved this out with a couple of examples that are still on Splunkbase.
A database is not a transport, but many vendors keep their event data locked up within databases. This by itself is not bad as long as the vendor can somehow provide a transport and credential based API to get to the data. The most bare bone example of this is having the data within an RDBMS and then using something like JDBC to access it. We would still need to know the name of the table and the minumum parameters for JDBC access.
I could go on and list other means of transport such as SNMP or FTP, but I think I’ve made my point with some of these examples. If we can can ask vendors to make their log events easily accessible via a known transport with examples along with having them follow best practices for creating these events, an interconnected world of actionable machine data becomes even more possible benefiting all interested parties.