Amazon’s Elastic Compute Cloud (EC2) is one of the most popular products on Amazon Web Services (AWS), used by 84% of companies on AWS according to 2nd Watch’s AWS Scorecard. Amazon EC2 provides the foundation for many organizations’ cloud strategies, enabling teams to allocate compute resources rapidly and easily meet demand at both high and low points for truly web-scale performance. Companies like Stormpath and Tapjoy rely on EC2 to run their production systems in an efficient and reliable way.
However, monitoring all the metrics for a production compute cluster still remains a significant challenge. Despite EC2’s resilience and elasticity, there are still ongoing objectives that require close tracking of capacity, predictability, and interdependence with other services and infrastructure. Similar to our review of Amazon RDS top metrics, here are the top indicators you should monitor for insights into availability, performance, and cost.
You need to know quickly when there’s an outage in your production servers. Outages can cause degraded user experience and, potentially, lost revenue.
Instance State and Health
If you have several instances in your production cluster, you should be aware of whether each instance is healthy or not. Amazon has the ability to track if each instance is in the running state, as shown in the screenshot below. A better health indicator will say if instances are responding to requests in an expected time and without errors.
Amazon performs status checks on all EC2 servers by default. They come in two flavors: system status checks and instance status checks. System status checks monitor conditions that require AWS’s involvement to fix, including loss of network connectivity and hardware issues. Instance status checks monitor conditions that you need to fix yourself, including exhausted memory and corrupt file system. The best practice is to set a status check alarm to notify you when a status check fails.
Servers can become unavailable when the resources that they need to support clients are exhausted. For example, web servers can become unresponsive when they lack sufficient CPU or memory to respond before timing out. If a server does not even have enough memory to support an incoming SSH connection, you will not be able to access it through a remote terminal. In this case, you’ll need to do a hard reboot, which risks losing the system state. A good monitoring system will store metrics from the instance and can show you an increase in its resource usage until eventually hitting a ceiling and becoming unavailable.
System errors can also cause your instance to become unavailable or fail a status check. You can find these errors listed inside your system log file, which is often in /var/log/syslog or /var/log/messages. You might see problems with boot up or kernel errors. You can aggregate these logs to Amazon CloudWatch Logs by installing their agent, or you can use syslog to forward the logs to some other central location.
Sometimes, especially in large teams, people can make changes that can impact service availability. For example, a misconfigured security group could make your instance unreachable, or an auto-scaling script could accidentally remove too many instances and make your service unavailable. You can find a record of all the changes made using the AWS API or console through the CloudTrail audit log. If you suspect an outage is due to changes made by a person or automated tool, this is the first place to look.
Performance impacts user experience and, therefore, impacts earnings. For web applications, slow page load times will lead users to abandon the page, some never to return. The performance of the server you’re running on underlies and determines application performance. It’s important to continuously monitor system performance because it is often most affected during bursts of activity or periods of peak demand. You’ll also want to drill down into performance data across several dimensions to determine the root cause of problems and address bottlenecks.
Turning on CloudWatch metrics monitoring for your instances gives you access to the data you need to optimize performance, as shown in the screenshot below:
CPU usage is shown as a percentage for each instance. If you are hitting a CPU limit on a shared instance, you might want to check out your CPU credit balance as shown in the CloudWatch dashboard above. Additionally, you should check the CPU steal time to see whether other VMs on the shared instance are using up the CPU. CloudWatch doesn’t store this metric, so you’ll need to check it on the box, either directly or using an agent such as collectd. If you find consistently high steal time, you might benefit from switching to a different instance that might be more lightly loaded. If that doesn’t produce appreciable results, consider shelling out for a dedicated instance.
CloudWatch makes your disk throughput and I/O operations per second (IOPS) data available, but you’ll have to log directly into the box or use collectd to check your disk usage. Even if you are using less than the full disk space, you might be hitting a ceiling on IOPS or volume throughput.
It’s easy to keep track of your network usage, including throughput and total number of packets. Instances have a limit on the throughput based on the instance type, so upgrading to a larger instance could improve your network performance. It’s also easy to overlook dependencies between resources. For example, using Elastic Block Storage (EBS) volumes can also increase your network usage.
Monitoring system performance on EC2 is a good start, but the most important metrics often come from applications because these are most closely tied to user experience. Examining your system and application performance together and finding any correlation between the two can give you clues to the root cause of a problem and how to optimize your resource usage.
CloudWatch does not offer a way to monitor application performance by default. You have to instrument your application to output metrics, and then use an agent or collector to receive and process them. CloudWatch does offer a way to forward custom events, but it does not offer specific integrations for popular servers or applications. A better approach is to use a standard agent such as collectd, which offers the ability to monitor widely-used technologies like Apache, Nginx, Java, and many more. To learn more about application performance management, click here.
Moving to larger instances or adding additional instances is an easy short-term solution to performance and availability issues. However, this can create problems with costs over the long run. Even a small increase in cost can become compounded and add up over time. Moreover, changes in the business or application can change the need for those instances, resulting in them being forgotten as projects and employees move on, or being left as technical debt that directly affects the bottom line.
When you hit a performance bottleneck, one of the easiest things to do is switch to a larger instance. The reality is that you can only do this so many times before reaching the largest instance size on offer, or spending so much it affects your operating margin.
Thankfully, you can save money by using lower-cost options including reserved instances, on-demand instances, or spot instances. Additionally, you may be able to use smaller instances by scaling horizontally instead of vertically. Keep an eye on how many instances you have of each type, and make sure that those expensive on-demand instances don’t stick around longer than they have to.
Scaling systems horizontally is another good way to grow. However, large clusters also take more operational effort to maintain because distributed systems are more complex. A common approach is to take advantage of Amazon Elastic Load Balancer (ELB) or Amazon Route 53 to distribute load across multiple instances and set up auto-scaling rules that add capacity as needed. [Note: ELB uses a proxy approach to load balancing, and Route 53 uses a DNS round robin approach.]
You can use scripts that automatically scale your infrastructure on-demand to match requirements. For example, if you hit a CPU threshold as monitored in CloudWatch, you can run the script, add instances, and update your proxies with the new node.
On the other hand, auto-scaling opens you up to serious risk if some kind of bug or error causes infrastructure to to spin up continuously. For example, if you tie auto-scaling to your CPU usage, and a bug causes your services to spin at 100% CPU, you may wake up the next morning to find a big AWS bill. That’s why a reliable monitoring solution is so important—it can alert you to problems like this before it’s too late.
Network and Storage Costs
Network and storage add to the price you pay for an EC2 instance. While you can easily see your network usage for a specific instance in CloudWatch metrics, it’s not so easy to see this quantity aggregated across all your instances. It’s even more difficult to see storage usage because it’s not tracked in CloudWatch. This includes your usage based on the volumes for each instance, as well as your snapshots, and even volumes that are not attached to instances. Remember that using EBS-backed volumes will use network capacity, as well.
Challenges of Monitoring
Amazon Cloudwatch makes monitoring your EC2 instance much easier, but it’s not a complete solution. There’s a variety of other data sources about your EC2 instances, including CloudWatch Logs, CloudTrail, and the Cost Explorer. You probably also want to track additional information from your instances, including from collectd, system logs, on-prem systems, and applications. Your ideal monitoring solution will offer integrations to aggregate data from many different sources.
Furthermore, many of the solutions that Amazon provides offer limited analytics support. In CloudWatch metrics, you can only plot the metrics it offers. It lacks the ability to derive or calculate new fields, to look for specific patterns of repeating behavior, or predict problems before they become critical. For CloudWatch Logs and CloudTrail, there is no visualization or analytics support.
Just last year, Amazon launched support for the Elasticsearch service with Kibana, which can help you analyze your log data. For real-time metrics monitoring, you’ll want to consider something more powerful but still easy to use like Splunk Infrastructure Monitoring.
Using Splunk Infrastructure Monitoring
Splunk Infrastructure Monitoring provides real-time cloud monitoring for CloudWatch metrics, your applications, and on-prem systems. It offers native integrations with metrics to monitor all of your microservices so you don’t have to build your own. It also offers analytics, allowing you to predict problems before they become critical, and provides meaningful alerts that you can act on right away.
In part two of this blog series, you’ll learn how Splunk Infrastructure Monitoring gives you insight into the current state of your Amazon EC2 instances so you can easily monitor availability and performance, as well as manage costs.