AWS is an expansive platform, with over 175 different types of cloud services available, according to Amazon. Monitoring the performance of so many different types of services may seem like a daunting task.
But the reality is that the vast majority of AWS workloads rely on a core set of services: EC2 (the compute service), S3 (object storage), EBS (block storage), and ELB (load balancing). For the typical organization, these services are the foundational building-blocks of an AWS environment.
Thus, monitoring AWS starts with monitoring these core AWS services. This article offers a primer on which metrics to collect for these services.
EC2 Metrics to Monitor
The EC2 compute service lets you run virtual machines in the AWS cloud. (There are a few bare-metal EC2 instance types available, too.) If you host any type of application in AWS, it’s likely that it runs on EC2. Even if you host it in a service like EKS (the AWS Kubernetes platform), it’s still running on an EC2 instance in most cases.
There are three key metrics to track for each EC2 instance:
- CPUUtilization The total number of CPU units that are being used, expressed as a percentage of the total available. If this metric exceeds about 80 percent for more than a brief period, you’ll want to investigate to determine whether you need to increase the CPU capacity allocated to your workload. Or, there may be a problem with your application that is causing excessive CPU usage.
- DiskReadOps The total completed read operations by the EC2 instance in a given period of time. When this metric deviates from the historical baseline average, it could be a sign that something is wrong with the application running inside the instance.
- DiskWriteOps The total completed write operations by the EC2 instance in a given period of time. Like spikes in DiskReadOps, DiskWriteOps data that deviates from the norm could signal an application problem.
Because S3 is simply an object storage service, there is not much that can go wrong with the service itself. Still, monitoring metrics from S3 can help identify potential problems with applications that are reading or writing data from S3. The most important S3 metrics include:
- AllRequests The total requests made to an S3 bucket. For most types of workloads, request totals should be relatively consistent over time. If monitoring data reveals a change in request frequency that can’t be explained by a change in your workload, you may want to investigate to make sure that an application is not making requests unnecessarily, for example, or issuing redundant requests. Not only could excessive requests signal an application problem, but they will also increase your bill, because AWS charges for each request.
- ReplicationLatency If you replicate S3 data between multiple storage buckets in order to increase resiliency, this metric tells you how quickly data is copied from one bucket to another. High replication latency could leave you at risk of data that is out of sync between mirrored buckets, which could lead to issues in an application that is trying to access multiple copies of the data at the same time.
EBS is Amazon’s solution for workloads that require block-level storage instead of object storage. EBS volumes tend to be especially important as storage for EC2 instances.
To ensure the health and performance of your EBS volumes, monitor these metrics:
- Volume status AWS performs health checks on EBS volumes at five-minute intervals and returns a status in the form of one of the following: OK, warning, impaired, or insufficient-data. If volume status is something other than OK, you should investigate.
- VolumeReadOps The total read operations in a set period of time. Like EC2 DiskReadOps, this metric can signal a problem with your application if it deviates from the baseline without a clear reason.
- VolumeWriteOps Total write operations in a set period of time. Use alongside VolumeReadOps to detect anomalous behavior that requires further research.
- VolumeTotalReadTime The total time (in seconds) that read operations required to complete in a set period of time. Slow read time could reflect inefficient requests by your application, or a problem with the EBS service itself. Or, your EBS volume type may simply lack the necessary I/O capacity; moving to a different type (like an SSD instance) will likely provide better performance.
- VolumeTotalWriteTime:: Like VolumeTotalReadTime, but for write operations.
ELB, AWS’s load balancing service, offers several types of load balancers that distribute application traffic across different EC2 instances. To ensure that ELB is properly allocating requests between the various instances in your environment, monitor the following metrics:
- RequestCount The total requests that ELB handles in a set period. While it’s natural for request counts to vary as demand for your application ebbs and flows, sudden spikes or decreases that are inconsistent with historical traffic patterns at a certain time of day or day of the week could signal a problem like the inability of users to reach your application.
- Latency A measure of the time it takes for one of your instances to start the response to a request from ELB. High latency could be a sign of problems such as an issue with the network or an under-provisioned EC2 instance that is struggling to handle all of its requests.
- UnHealthyHostCount ELB performs health checks on instances and uses this metric to count those that it deems to be unhealthy, meaning that they are not ready to handle requests. Monitor this metric to ensure you don’t run out of sufficient healthy instances to handle application demand.
The metrics and services described above are only the tip of the iceberg for AWS service monitoring. There are a number of other metrics for each of these four services that you may wish to track, depending on your needs. There are also a variety of other AWS services that you may need to monitor, depending on which services you use to run your workloads.
As noted above, however, the four services we’ve covered here tend to account for the majority of AWS workloads, and the metrics we’ve discussed are generally the most important for those services. That makes them a good place to start as you build an AWS monitoring strategy.
Find out more about how Splunk Observability Cloud works with AWS to get you meaningful insights. Watch this video on monitoring AWS workloads with Splunk Infrastructure Monitoring and sign up for your free trial today.
What is Splunk?
This is a guest blog post from Chris Tozzi, Senior Editor of content and a DevOps Analyst at Fixate IO. Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure, and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO. This posting does not necessarily represent Splunk's position, strategies, or opinion.