Hadoop 2.0 rant

By Splunk

Here we go, time for another rant about Hadoop, this time about Hadoop 2.0. You can read the first rant here.

The rant this time is about Yarn and the way it stores the application logs.

Let’s start with looking at what the main uses of log files are and try to come up with a set of best practices for them. A log file can be used:

by humans, to troubleshoot application problems
by humans, to analyze/understand application behavior
by other applications, to monitor/alert/react to certain application behaviors
as a method of auditing/recording activity within an application (e.g. user action auditing)
etc

From these use cases a few high level best practices fall out almost immediately, the log files should:

be (easily) accessible by developers, sys admins, etc
be in a format that is easily readable, e.g. text file, timestamped etc
contain meaningful/useful information that can be easily extracted

Are we in agreement so far?

Yarn offers a feature called “log aggregation”, basically collect application logs into a location in HDFS – which is great as this improves accessibility – the logs can be accessed from one central (alas distributed) location.

Now, back to the rant … Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”. Hmmm okay … I wonder how one can generally store application logs into a key-value container!? Maybe key=time, value=log-message? Maybe key=log-level, value=log-message?

Let’s not speculate and dig into the Yarn code and see how it uses the TFile – you can find the Java code here and it’s uses here. The logic from the NodeManager goes something like this:

  wait for an application to finish
  for each container in the application (executed on this node)
     collect all the logs and write them in a TFile in HDFS
     using the following format:

KEY	VALUE
VERSION	log file format version
APPLICATION_OWNER	application owner
APPLICATION_ACLS	application access control
[container-id]	encoded list of container log files

So much for readability and accessibility of these logs!! Anyone who’s interested in reading these logs (human or application) would need to know how to read a TFile, know how to decode the list of container log files and then and only then be able to access these logs – I don’t think that’s much better than digging into the machines where these logs resided in the first place!

I just don’t see what the point of using a TFile is? Application owner and acls are redundantly stored on every single file, why? They should just be stored separately as a metadata file into the log directory and used to control access from there.

Encoding the list of log files is flat out a bad idea! In order to read one of the files you’d have to decode all the previous log files encoded before it in the value list.

So, if the key-value pairs stored in the TFile are not needed, if storing and encoding all log files in a single byte stream is “less than ideal”, why use the TFile at all?

Sounds like over-engineering to me, let me know what you think in the comments section below …

----------------------------------------------------
Thanks!
Ledion Bitincka

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

Tips & Tricks 2 Min Read

Integrating Active Directory into Splunk with SA-ldapsearch

Tips & Tricks 2 Min Read

Announcing the Splunk Add-on for Check Point OPSEC LEA 2.1.0

New Splunk Add-on for Check Point has streamlined interface, connection name filtering & pagination help manage connections, workflow fixes CA problem.

Tips & Tricks 3 Min Read

Look at all the pretty colors!

Splunk Power User Bootcamp at .conf2014 decides to assign colors to event types in standard web-based GUI; used users, events & ‘risklevel’ & assigned colors.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Hadoop 2.0 rant

Related Articles

Integrating Active Directory into Splunk with SA-ldapsearch

Announcing the Splunk Add-on for Check Point OPSEC LEA 2.1.0

Look at all the pretty colors!

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram