TIPS & TRICKS

The Naughty Bits – How to Splunk Binary Logfiles

You’ve whipped the winevent streams, deftly dealt with daunting data inputs, and perhaps even concocted clever code for scripts.  But while most logs are encoded in ASCII (or at least half-ASCII), there are still devices and applications out there which produce binary-encoded logfiles.  What if you have to Splunk such a beast?  Getting this data into Splunk requires a little extra work, but is a straight-forward process.  It will require some scripting skills (in your favorite language, such as Perl, Python or Java), access to vendor reference manuals and hexadecimal conversions, perseverance, and ready supply of your favorite code-slinging beverage.

Bit Wise and Byte Foolish

Binary-encoded logfiles seemed like a good idea when they were introduced during the Eisenhower administration: they are compact, to take up minimal space on your punchcard-based storage; they are very orderly, which appeals to the pre-opensource mindset of the day, and they probably require special utilities to even read the data, which appeals to the vendors wanting to sell more software.  Many voice switch systems still produce binary-encoded logfiles, such as Huawei, Ericsson, Lucent and others.

The log data will be output in fixed-length records, composed of various fields of data.  Just to keep it interesting, the fields will probably be encoded in a variety of different formats, such as:

  • Binary Coded Decimal – Each decimal digit occupies four bits, and only hex values 0-9 are used.  Thus, a two-byte value which looks like this:
      1st byte     Bit 7     Bit 6     Bit 5     Bit 4     Bit 3     Bit 2     Bit 1     Bit 0  
      Value   0 0 0 1 0 1 0 0
      2nd byte     Bit 7     Bit 6     Bit 5     Bit 4     Bit 3     Bit 2     Bit 1     Bit 0  
      Value   0 0 0 0 1 0 0 1

    … and translates to a decimal value of “1409”.  Get it?

  • Little Endian – This means that the least-significant byte is first– backwards from the conventional view of how things should be laid out.  A four-byte little-endian value looks like this:
      1st byte     2nd byte     3rd byte     4th byte  
    04 00 00 00

    … and translates to a hex value of “00000004”, or decimal “4”.

  • Big Endian – No, not Tonto. This means that the most-significant byte is first.  A four-byte big-endian value looks like this:
      1st byte     2nd byte     3rd byte     4th byte  
    04 00 00 00

    … and translates to a hex value of “04000000”, or decimal “67108864”.

  • ASCII – Occasionally the vendor will slip up and encode fields in plain-ole-ASCII.  But not often.

The Plan of Attack

Since we will be creating a script to convert binary log records into Splunk-ready ASCII, we might as well make it as Splunk-friendly as possible.  Thus, the script should:

  • Take input data as STDIN, and output converted data directly to STDOUT.  This will allow Splunk to stream the data inputs, in much the same way that it handles compressed (i.e., gzip’d) files.
  • Output the data in key-value pairs, so that Splunk will auto-magically extract the field names and corresponding values.  Use a format like this:

    FIELDNAME1=”Field value 1″,FIELDNAME2=”Field value 2″, …

After creating a conversion script, make the appropriate entries in inputs.conf and props.conf, then start Splunking!

Gotchas

Here are some potential problem areas and recommendations:

  • Build-in a debug option for your script to output in a more human-friendly format, for initial development and later troubleshooting.  Consider adding extra linefeeds, record counter numbers and similar niceties.
  • Make sure to build in a way to sanity-check if your utility gets “off” on record boundaries.  Since we are processing a sea of bits, it is not intuitively obviously if our script were to derail and start processing the wrong data from the wrong places in a record.  Choose a field which has a predictable value, such as sequential record numbers or the year portion of a date stamp or something similar.  Add a “valid_record” field or some other means of detecting that the conversion has derailed and that the field values may be bogus for this record.
  • Whenever possible, go ahead and perform the lookup for enumerated values.  Thus, instead of showing User_Type=”1″, show User_Type=”PREPAID_SUBSCRIBER”. This will simplify the Splunk configuration, and improve the overall speed of processing records.
  • I prefer code which requires a minimum of external modules and dependencies.  This will make it more portable, though it may require some extra coding.  And the extra suffering will build character.

Example

Let’s look at an example, based on binary-encoded Call Data Records from a Well Known Switch Vendor, and a Perl script to convert it.  For the sake of brevity, we will assume a record which contains only four fields, rather than the actual record which contains 100 fields (you’re welcome). I have written the conversion script in Perl, my hack-ware of choice. I have done just enough software development to appreciate Real Software Developers, in the same way that I have sweated just enough copper pipes to appreciate competent plumbers. Thus, I make no claims that my code is the most elegant or the most efficient.

Let us assume that each binary record is 17 bytes long, and contains the following fields:

  • Serial_Number – Sequential record serial number, 4 bytes in little-endian format. Example:
      1st byte     2nd byte     3rd byte     4th byte  
    01 02 03 04

    This translates to a Serial_Number of 04030201 (hex), or 67305985 (decimal).

  • CDR_type – Enumerated value, 1 byte. Example:
      Decimal Value     CDR Type  
    0 IWFQNC
    1   PDSN_BILL  
    2 ROAM
    anything else Unknown
  • Charge_start_time – Indicate year, month, day, hour, minute, and second, in sequence. The “year” is in the little-endian format and occupies two bytes, the lower byte is in the front and higher byte is in the back. The “month”, “day”, “hour”, “minute”, and “second” occupies one byte respectively. Example:
      1st byte     2nd byte     3rd byte     4th byte     5th byte     6th byte     7th byte  
    CE 07 0A 0B 08 16 1A

    This transaltes to a Charge_start_time of “1998/10/11 08:22:26”.

  • Caller_party_number – 10-digit phone number, in Binary Coded Decimal (BCD). 5 bytes, with 2 decimal digits encoded in each byte, in big-endian order. Example:
      1st byte     2nd byte     3rd byte     4th byte     5th byte  
    30 32 92 37 76

    This translates to a Caller_party_number of “3032923776”.

My perl script processes the data in three steps:

  1. Grab data in record-sized 17-byte chunks, assigning the data into “raw” variables.
  2. Rearrange/tweak the raw variables as necessary to derive the correct, translated, ready-to-output values.
  3. Output the translated records in pretty, Splunk-ready format

The following script makes extensive use of the ‘pack’ and ‘unpack’ functions; we leave it as an Exercise For The Student to learn the nuances of these perl functions.

Here is a sample Perl script to process the binary-coded records described above.

Test It

From the command line, test the script in this way:

# cat binary_logfile | cdr2text.pl

It should produce output like this:

2010/12/02 04:07:39 Serial_Number="2125080384",CDR_type="PDSN_BILL",Charge_start_time="2010/12/02 04:07:39",Caller_party_number="4145559199",
2010/12/02 04:07:53 Serial_Number="2125080385",CDR_type="PDSN_BILL",Charge_start_time="2010/12/02 04:07:53",Caller_party_number="2195556490",
2010/12/02 03:59:18 Serial_Number="2125080386",CDR_type="IWFQNC",Charge_start_time="2010/12/02 03:59:18",Caller_party_number="3315552717",
. . .

Looking good!

Splunk It

Add a stanza to inputs.conf, similar to this:

[monitor:///var/log/cdr_data/*]
disabled = 0
followTail = 0
host = voice_switch
index = cdr
sourcetype = cdr_binary

Add a stanza to props.conf to pre-process the binary logs. Make sure to use the same sourcetype as the inputs.conf entry:

[cdr_binary]
invalid_cause = archive
unarchive_cmd = /usr/local/bin/cdr2text.pl

And the rest is history.

Conclusion

Although the majority of logged event data is text/ASCII, there are still systems which generate binary-encoded log data, including a number of voice switch products.  Pulling this data into Splunk can yield extremely valuable insights, such as call volumes, per-user trends and fraud/abuse analysis. With a little scripting, such data can be readily converted and streamed into Splunk. Amaze your coworkers, and possibly even your boss!

David Millis
Posted by

David Millis

Join the Discussion