How to Monitor Your AWS Workloads

Observability August 25, 2021 Splunk

AWS is a comprehensive platform with over 200+ types of cloud services available globally. As organizations adopt these services, monitoring their performance can seem overwhelming.

The majority of AWS workloads behind the scenes are dependent on a core set of services: EC2 (the compute service), EBS (block storage), and ELB (load balancing). For most organizations, these services are at the foundation of their AWS deployments, thus understanding how to monitor these services is at the core of ensuring successful workloads.

This blog will break down the key steps to monitor your AWS services with Splunk Infrastructure Monitoring and discuss a few key AWS infrastructure metrics for the major AWS services.

Want to skip the examples and see for yourself? Start a free trial of Splunk Observability Cloud instantly, no credit card required.

To get started, there are a few prerequisites you need to be aware of. When connecting AWS to Splunk Observability Cloud, you must have an access token for the organization you want to get data into. With a free trial account, an access token named Default has already been created for you. Otherwise, for more on creating organization access tokens, see our docs page on creating and manage organization access tokens.

Once your prerequisites are in order, you will want to log into Splunk Observability Cloud and navigate to Data Setup. On the AWS Setup page, select New integration to open the AWS integration wizard. Click + Add Connection to configure an integration for one of your AWS accounts and follow the four steps needed to create your connection. Although our step-by-step process takes you to every step in detail, you can always check out the docs page on connecting to AWS for more information.

Once connected, Splunk Infrastructure Monitoring will enumerate all of your AWS services. Navigate to the Infrastructure page, select Amazon Web Services, to see a list of all AWS resources in a single pane of glass. Below is an example of my deployed services.

From here, we can quickly dive into each of these services and inspect their metrics to understand better how they are performing. Let’s drill down into some of these metrics.

EC2 Metrics

The EC2 compute service lets you run virtual machines in the AWS cloud. (There are a few bare-metal EC2 instance types available, too.) If you host any kind of application or service in AWS, it likely runs on EC2. Even if you host it in a service like EKS (the AWS Kubernetes platform), in most cases, it’s still running on an EC2 instance. Splunk Infrastructure monitoring provides you with an excellent overview of all your EC2 metrics by color-coding key metrics as well as your Kubernetes deployment with Kubernetes navigator. Below is an example of how your EC2 instances are shown and the color-coded filter options available for you to choose from.

You can also group common EC2 instance types by various options such as region, state, os type, and more. Here we have an example of a known instance with high CPU utilization (Instance ID omitted), allowing you to identify problematic instances quickly.

While there are many metrics to choose from, there are three key metrics to track for each EC2 instance.

CPU Utilization: The total number of CPU units used, expressed as a percentage of the total available. If this metric exceeds about 80 percent for more than a brief period, you’ll want to investigate whether you need to increase the CPU capacity allocated to your workload. Or, there may be a problem with your application that is causing excessive CPU usage.

^{Image showing current EC2 CPU percentage used.}

DiskReadOps: The total completed read operations by the EC2 instance in a given period of time. When this metric deviates from the historical baseline average, it could signify that something is wrong with the application running inside the instance.
DiskWriteOps: The total completed write operations by the EC2 instance in a given period of time. Like spikes in DiskReadOps, DiskWriteOps data that deviates from the norm could signal an application problem.

^{Image showing current EC2 disk ops.}

EBS Metrics

EBS is Amazon’s solution for workloads that require block-level storage. EBS volumes tend to be especially important as storage for EC2 instances. EBS monitoring with Splunk Infrastructure Monitoring also follows the same workflow as EC2 monitoring. It starts with an overview map color-coding all of your EBS volumes, allowing you to group them by common characteristics. If a problematic volume is identified in the overview map, you can quickly select it to drill down and gather specific information about it. Here is a great example of the color-coded key metrics shown within the console.

This second example shows how we can easily find a problematic volume and drill down into the details.

To ensure the health and performance of your EBS volumes, be sure to stay aware of these metrics:

Volume State (aws_state): AWS performs health checks on EBS volumes and returns a status in the form of one of the following: creating, available, in-use, deleting, deleted, or error. If the volume state is showing error, it may be best to investigate. Other states to consider investigating are available, deleting or deleted, depending on the scenario.

^{Image showing current aws_state of an EBS volume.}

Total IOPS: This is the total read and write operations in a set period of time. High metrics beyond your normal baseline can indicate application bottlenecks or poor storage selection. Below is an example of a chart showing the total IOPS separated by read and write operations updated every minute for an EBS volume attached to an EC2 instance hosting a front-end microservice. This particular release of the microservice contains new static content with an expected higher VolumeReadOps. Higher VolumeWriteOps, in this case, was caused by the microservice having an excessive logging level set that was being written to the volume - seeing the lowered VolumeWriteOps suggests the release fixed this problem.

Average Queue Length: Volume queue length is the number of pending I/O requests measured by its latency. This latency shows the time elapsed between sending an I/O to EBS and receiving an acknowledgment from EBS that the I/O read or write is complete. High latency on an EBS volume might show the need for a possible well-suited volume such as an SSD-backed volume.

^{Image showing current average queue length of an EBS volume.}

ELB Metrics

ELB, AWS’s load balancing service, offers several types of load balancers that distribute application traffic across different EC2 instances. ELB monitoring with Splunk Infrastructure Monitoring also follows the same workflow as EC2 and EBS monitoring. An overview map color-codes all of your Elastic Load Balancers, allowing you to group them by common characteristics. Suppose a problematic load balancer is identified in the overview map. In that case, you can quickly select it to drill down and gather specific information about it, just as EC2 instances and EBS volumes. Splunk Infrastructure monitoring also provides an excellent overview of all your Elastic Load Balancers volumes similar to EC2 instances and EBS volumes by color-coding key metrics. Below is an example.

This second example shows how you can quickly drill down into a specific load balancer for detailed information (ELB ID omitted).

To ensure that ELB is properly allocating requests between the various instances in your environment, be sure to monitor the following metrics:

Request Count: This is the total requests that ELB handles in a set period. While it’s natural for request counts to vary as demand for your application ebbs and flows, sudden spikes or decreases inconsistent with historical traffic patterns at a specific time of day or day of the week could signal a problem like the inability of users to reach your application.

^{Image total routed requests per min of a given ELB.}

Latency: A measure of the time it takes for one of your instances to start the response to a request from ELB. High latency could be a sign of problems such as an issue with the network or an under-provisioned EC2 instance struggling to handle all of its requests.

Image of the average latency of a given ELB.

Unhealthy Host Count: ELB performs health checks on instances and uses this metric to count those that it deems unhealthy, meaning that they are not ready to handle requests and may be down. Monitor this metric to ensure you don’t run out of sufficient healthy instances to handle application demand.

^{Image showing the amount of health and unhealthy host on a given ELB.}

The metrics and services described are just the tip of the iceberg for AWS service monitoring. Depending on your deployment, you may wish to track several other metrics for each service, such as cloud spend.

Cloud Spend

If you are not sure how much you are spending on AWS, Splunk Infrastructure Monitoring can also help. With the AWS Optimizer, you can quickly identify cost-saving opportunities. I recommend checking out our docs page on our AWS Optimizer and this excellent blog by Greg on How to Optimize Your Cloud Spend Using Observability, where you can discover incredible examples of the AWS optimizer in action.

So, wherever you are within your cloud journey, Splunk Infrastructure Monitoring can help. Be sure to get a clear understanding of what’s going on with your infrastructure. Want to learn more about how Splunk Observability Cloud works with AWS to bring you meaningful insights? Watch this video on monitoring AWS workloads with Splunk Infrastructure Monitoring and sign up for your free trial today.

----------------------------------------------------
Thanks!
Johnathan Campos

Style

two-column

No results

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer