Enterprises are moving to microservices architectures, continuous delivery practices, and embracing DevOps culture. This is the foundation of a modern, “cloud-native” business. At Pivotal, we help companies make this transformation with our Pivotal Cloud Foundry product.
Our customers want to extend the utility of Splunk to include their new cloud-native apps running on Cloud Foundry. To this end, we’ve been working up an integration between these two products. This post reviews our progress so far, and concludes with an invite to our private beta program.
What is Pivotal Cloud Foundry?
Pivotal Cloud Foundry is a platform, based on open source software, for deploying and operating applications. These apps can be deployed both on-premises and in the top public clouds. The product supports deploying a wide variety of languages (Java, .NET, Node.js, Python, and many others) in a uniform way.
Want to iterate on custom code quickly, but don’t want to re-solve all the problems of building a platform, container orchestration, and elasticity? Then Pivotal Cloud Foundry is worth a look! To learn more, visit the Pivotal Cloud Foundry platform site.
Metrics and logging are a big part of the platform, so let’s jump right into the integration with Splunk.
Cloud Foundry Logging Overview
Loggregator is Cloud Foundry’s logging system. It aggregates and streams logs from all user applications and platform components. The key concepts are:
- Individual application developers connect to Loggregator to examine the logs of their app
- Cloud Foundry operators use this same system to monitor the platform itself
- The Firehose is a Loggregator feature that combines stream of logs from all apps and metrics data from CF components
- A nozzle connects to the Firehose via WebSocket to receive these events
I ran a Splunk nozzle locally and captured all events in one of my team’s test environments. This resulted in ~170 events per second (EPS). The average event size varies based on the actual event mix between metrics & logs, and the size of application’s custom logs. Assuming a conservative average event size of 350 bytes, this translates to almost 5 GB/day of valuable data. This was a small environment (31 VMs), configured for high availability (i.e. nothing was scaled out, but redundancy was configured across availability zones).
At the other end of the spectrum, we recently did some Cloud Foundry scale testing, running applications in 250,000 containers. In a larger environment like that – which is common within Pivotal’s customer base – the underlying platform is over 1,500 VMs (50 times bigger than my test example). Imagine the amount of data that would generate!
With that many platform events, a solution like Splunk Enterprise is really useful to understanding what’s going on, which is where the new Splunk nozzle for Cloud Foundry Firehose proves helpful.
Splunk + Cloud Foundry
Now for a concrete example: let’s take a look at a message sent by the Gorouter service in Cloud Foundry. This component routes incoming traffic for both applications and the platform itself. The router periodically reports the total number of requests. Here’s a single message from the nozzle.
The “job” is router; this is component reporting. Components are scaled out, so “job_index” is the identifier for the individual VM that’s reporting. “CounterEvents” are strictly increasing until that instance of a component is restarted. The name is what’s getting counted, and each component reports several values.
After tracking this metric over time, we can run a Splunk search to translate all this data intro an interesting graph for this part of the platform:
From the graph, it’s obvious this is a test environment: there are only a handful of requests per minute. To demonstrate a situation that might warrant investigation, I deployed an app and started making continual requests against it.
The chart shows nearly an order of magnitude more incoming requests: the sort of event an operator might want to examine further (perhaps to scale out components).
Here’s the underlying Splunk query:
sourcetype="cf:counterevent" source="router*" name=total_requests
| timechart span=2m max(total) as t
| streamstats last(t) as t_prev current=f window=2
| eval delta=t-t_prev | rename delta as requests
| fields _time, requests
| where _time<now()-120
This query takes advantages of several Splunk features to generate a visualization. It uses timechart and streamstats to build a delta across five minute increments. Summing the delta from the payload would be make simpler query, but a missed message would really throw off the graphs in that case. The subtraction at the end drops the last time bucket, as there’s nothing to calculate a difference against. After building several visualizations like this, I’ve really become a fan of Splunk’s search language.
The full solution looks like this:
Pivotal Cloud Foundry Splunk Enterprise
+------------------+ BOSH Managed VMs +---------------+
| | +------------------+ | |
| | | +-------------+ | | |
| | | |Splunk heavy +--------> |
| | | | forwarder | | | |
| | | +-----^-------+ | | |
| Loggregator | | | | | |
| +-----+ | | | | +---------------+
| | +--+ | | +----+-----+ |
| | | +---------> Nozzle | |
| +-----+ | | | +----------+ |
| |------+ | | |
Pivotal Cloud Foundry aggregates logs and events, and ships them via the firehose, as described in the previous section.
To harvest events, the solution uses BOSH to deploy and manage a nozzle as well as a Splunk heavy forwarder, both co-located together on VMs. A full description of BOSH is outside the scope of this post, but for the short summary it’s a tool that:
- Provisions VMs on multiple IaaS providers
- Installs component software on those VMs
- Monitors component health
BOSH can scale out the nozzle/forwarder VM as needed, based on the size of the platform.
Co-locating a nozzle with the heavy forwarder enables several features:
- The forwarder buffers data during events like a network partition or a temporarily downed indexer
- The forwarder can securely forward data to external Splunk deployment using SSL client authentication
- The nozzle parses and sends JSON to the local forwarder, so events can be richer than they might otherwise be with a solution like text tile parsing. Metadata like application info can also be added.
- The nozzle only forwards locally, so we don’t have to add complexity for features around acknowledgement (as this is handled by the forwarder already)
Do you use the open source version of Cloud Foundry? The Splunk nozzle is easy to run locally, checkout the open source nozzle code and test it out. The full, BOSH managed solution, is also available as open source.
We’ve also built a Splunk Add-on for Cloud Foundry which includes pre-built panels that operators can use as a starting point to build dashboards for their installation, in addition to the sample operational dashboard shown above.
For Pivotal Cloud Foundry operators, the tile is currently in closed beta. Contact your account manager if you’re interested in trying out the MVP.
For technical questions or feedback, feel free to contact myself or my Splunk counterpart (Roy Arsan).
Pivotal Cloud Foundry @pivotalcf