
Many Splunk customers have very diverse Java application server environments – WebSphere, Weblogic, Glassfish, JBOSS, your favorite app server, all running side by side supporting tens and hundreds of corporate and customer facing applications. We find that Splunk is used very often in such environments, precisely because of its ability to work with ANYTHING. So when we get the question, does Splunk work with WebSphere or will Splunk work with my custom, unique, extremely critical application environment, we feel the need to emphasize again – Splunk works with ANY and ALL text based, machine generated data. (yes – even with multi-line datasets including stack traces and such)
What data should you get into Splunk?
Have Splunk index your log files, monitor your configuration files and pull metrics. Let’s use WebSphere application server (WAS) as an example. Let’s say you’ve been recently assigned to supporting a WebSphere based application server environment and want to use Splunk for monitoring and troubleshooting. WAS has standard log files that it generates, such as Java virtual machine log files or process log files. Each of these log files provides different kinds of insight into your environment. Typically you want to index the Java virtual machine log files that contain application server related messages. You would install a lightweight Splunk forwarder on the machine running this application server and use the Splunk option to “monitor files and directories” to point to the directory containing your JVM logs. There are multi-line events in the JVM logs such as stack traces and certain messages generated by WebSphere, which means using the lightweight Splunk forwarder is the best option for sending the events to a remote Splunk server. If your application writes its own logs through Commons Logging, Log4J, etc then those logs can be monitored with the same Splunk forwarder.
By default, WebSphere writes its logs to $WAS_PROFILE/logs, where $WAS_PROFILE is the directory where your WebSphere profile is created. In the logs directory you will find a directory for each application server defined for that node as well as a “ffdc” directory. Be sure to monitor the files in each of those directories.
What are types of issues you would look for in a WAS environment?
Well..hung threads are a very common problem with J2EE applications, that tend to hog CPU and memory resources of the machine , usually causing application outages later on. IBM WebSphere allows you to set policies on when you can declare a thread as “hung” and on detecting that a particular thread has exceeded this time, it will log warnings that may look like “WSVR0605W: Thread threadname has been active for hangtime and may be hung. There are totalthreads threads in total in the server that may be hung.”
Another example is an unexpected application server restart. When this occurs, the node agent will be unable to reach the application server process. In this case, you can see messages like ADML0063W and ADML0064I in the SystemOut.log file of the node agent, and a drilldown search with Splunk will let you see if there was any corresponding “OutOfMemoryError” or “Hang” condition in the application server that caused the failure.
The best thing to do is to create alerts in Splunk for messages such as the above. That way, you can identify suspicious out of control threads and fix them before the user even notices any issues. For more info on setting policies etc, see this handy link:
http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/topic/com.ibm.websphere.nd.multiplatform.doc/info/ae/ae/ctrb_hangdetection.html
JVM log files provide logging great deal of intelligence about application execution. Using Splunk, you can extract fields associated with event types in WebSphere JVMs and easily drilldown to find the methods or classes that have too many errors, are taking too long to respond etc.
A very common use of Splunk is to measure response times. Since logs contain timestamps for everything and Splunk captures and interprets these time stamps – it becomes really easy to measure how long it took for method “Coffee” to do “grind beans”, “filter” and “steam milk” tasks. We have recently put together an awesome document that tells you how to do many of these things with Splunk
Monitoring configurations.
Why should you monitor config files? Application performance issues or outages are often caused simply because someone changed something in the environment. Code changes apart, configurations are the other major sources of change. Splunk can directly monitor WebSphere config files usually located at $WAS_PROFILE/config/cells. (fschange monitoring of the entire tree of files under $WAS_PROFILE/config/cells will notify you when a file is changed via normal methods AND if someone manually updates one of the files on disk (e.g. setting enabled=”false” in security.xml).)
In addition to this, you can also search the SystemOut.log files for the WebSphere message code ADMR0016I -this will list each of the XML configuration files changed via normal methods, such as the administration console, wsadmin, etc. This will also list which user performed the change( although not what the change was within the files.)
You may also want to monitor files under the “etc” and “properties” directories for changes.
And for metrics – stay tuned , we have an add-on coming that will help you pull JVM metrics ( and logs and configs) very very easily. If you’re interested in trying this out – drop me an email at ljoshi AT splunk DOT com…so we can provide you with an upcoming beta and you can give us some feedback! Happy Splunking!
----------------------------------------------------
Thanks!
Leena Joshi