Software monitoring, how does it work?
“We paid for a bunch of tools but we don’t know what we should be looking at. There are tons of charts that don’t seem to mean anything!”
If you talk to people about software monitoring you’ve inevitably heard something similar to this. With so many possible metrics it can feel like searching for a needle in a haystack. Even with curated dashboards there is inherent confusion about what is important. A great way to get started is to apply the 4 “Golden Signals” of Latency, Errors, Traffic, and Saturation (L.E.T.S.). These four concerns provide a fairly generic framework you can use to understand your software and infrastructure.
But they can also be applied to non-software related scenarios! Interested? Read on!
Let’s Talk About L.E.T.S., Baby!
Let’s create a hypothetical non-software example to illustrate the power of the Golden Signals! Imagine you run a busy restaurant. The restaurant seems to be doing really well, but you don’t quite know where to look to make improvements or cut costs, so you decide to start measuring. How do you decide on what to measure? Applying L.E.T.S. you might be concerned about:
- Latency: How long does it take to get food to a customer?
- Errors: How often are we unable to make a meal or have to comp a free meal?
- Traffic: How many customers are we taking in (and when)?
- Saturation: How many meals can employees actually complete and serve at the same time?
Monitoring these concerns would allow you to make informed decisions on scaling aspects of your business and the impact of any changes.
Latency metrics will help you decide if you need to hire more cooks, servers, or upgrade equipment.
Errors will help you measure improvements from better training, staffing, and equipment.
Traffic helps you understand how much staff you need, when you need them most, and when you can schedule fewer. Measuring customer traffic may even help you decide when it is time to expand!
Saturation can help uncover scheduling deficiencies, issues preparing certain popular dishes in parallel, and other unknown efficiency gaps.
These are all things you may have been able to guess about as a restaurant owner but without measuring them, how would you know for sure?
These basic concepts provide a basis for understanding complex systems in general; like our imaginary restaurant. But where they really shine is monitoring complex software architectures!
In the age of microservices, specific domain knowledge of every element of a software system may be impractical. Applying the concept of L.E.T.S. can provide the foundation for basic troubleshooting of where issues arise in a complex system.
An IT analyst who isn’t an expert on a given service can use Latency, Errors, Traffic, and Saturation to more readily identify issues in connected systems:
- “Latency appears to be much higher than normal to the database. Is that DB on-prem or in us-east-1?”
- “Errors are spiking after that last deployment. We should roll back.”
- “Traffic has totally dropped off at the load balancer! Did our cert expire?”
- “Saturation seems to be increasing more quickly than usual and we’ll run out of storage soon.”
This sort of foundational knowledge allows us to quickly check known points of failure before diving down rabbit holes. Not sure if you’re already measuring these sorts of things? Keep reading!
Figure 1-1. Splunk APM highlighting the L.E.T.S. metrics produced from Checkout to Payment in Hipster Shop. That 15% error rate is something we should look into!
L.E.T.S. Get It Together!
Now that you have your conceptual framework for a minimum set of four metrics. Where do you get them? Distributed Tracing at it’s very core is about the latency, errors, and traffic of requests traversing a system. When you feed your tracing data (sometimes called APM data) into a solution like Splunk APM you’ll start to get those metrics right away! Easy peasy. But that still leaves saturation.
Saturation is a bit more up to your software and design decisions. Consider a couple of examples of saturation:
- Is your software CPU bound? Does it rely on a certain amount of CPU power being available at a given time?
- What about memory? Would increasing memory usage cause your software to crash due to an Out Of Memory cleanup and cause failures?
- Is storage your concern? Maybe a DB, network disk, or even local disk is at risk of filling up?
- Are you running enough hosts (containers/VMs/etc) to service all of your traffic?
The answers to some of the above are likely “no” for any given application in your environment. But taking the time to think about that and map out where resource constraints and saturation may cause failures will help reduce chart clutter and increase troubleshooting speed. Knowing the “known knowns” will help you start to focus on the issues at hand and reduce side tracking.
If You Don’t Know, Now You Know!
So the answer to “what should we be monitoring?” is simple. L.E.T.S.! Look at the points of Latency, Errors, and Traffic between microservices, data centers, even between individual software components. Applying these methods across microservices that share common infrastructure patterns (E.G. JVMs running on EC2 and using DynamoDB, Python based Cloud Functions with a Cloud SQL datastore, or any other repeatable combination) will also allow you to minimize things like Dashboard and Alert bloat. Imagine a single dashboard containing L.E.T.S. charts for each piece of commonly used infrastructure. By including a dimension like `servicename` across all of those metrics that single dashboard can be easily filtered to quickly view a large swath of your microservices footprint. Alerts can be minimized similarly by focusing on the L.E.T.S. fundamentals and repeatable infrastructure patterns. But let’s save that story for another time.
Regardless of if you’re a seasoned Splunk Observability user, just starting a trial, or just thinking about getting your feet wet. Keep these principles in mind and you’ll quickly be on your way to greater observability into your software and infrastructure!
You can sign up to start a free trial of the Splunk Observability Cloud suite of products today!