At SignalFx, we are known for our real time streaming analytics. However, not everyone has moved to high frequency monitoring of their infrastructure and applications‒it tends to be gradual process. AWS and CloudWatch are a great examples. At reporting frequencies of one minute at best, and often five minutes or more, it is not a data set suited to real time analysis. However that does not diminish the importance of CloudWatch metrics, because they provide a valuable multi-dimensional dataset that is well documented, of critical importance to many users, and sometimes the only source of metrics for many AWS services.
Many of our customers have expressed a desire to have better visualization and analytics on CloudWatch metrics than what is available natively in CloudWatch itself. That got us thinking about how to provide a better solution?
While it is easy enough to sync all CloudWatch data as well as metadata (tags) to SignalFx simply by entering credentials, could we dramatically improve the CloudWatch experience for our customers? How could we use the rich dimensionality of the dataset as well as the customer’s own AWS tags to improve the experience even further? How could we make CloudWatch easy to use for common operational monitoring workflows with a minimal burden on users and minimal need for configuration?
The Ingredients Necessary for a Solution
SignalFx was built from the ground up to address modern monitoring challenges. As a result, many of its features are a natural match for the task:
- Multi-dimensional timeseries: In SignalFx, any timeseries can have an arbitrary set of dimensions. This fits well with the CloudWatch model which also uses dimensions.
- Flexible metadata: An important part of our platform, along with our timeseries store, is our metadata store. The metadata store allows users to apply tags and properties to their metrics easily and programmatically. In case of CloudWatch, this fits perfectly with the AWS tags that customers apply to their service instances.
- Powerful analytics: The SignalFlow™ analytics engine, designed for streaming real-time data, works equally well with non-real-time data. In case of CloudWatch, it allows us to build useful and interesting views of the data, like compound metrics built by combining multiple individual ones (e.g. cache hit ratio from cache hits and cache misses), comparing a metric against last day or last week using the TimeShift feature, or computing percentile distributions of a metric (e.g. p10, p50, p90 of latency) across a population of instances.
- Customizable visualizations: Not only does SignalFx support a large number of interesting chart and visualization types, but they are also highly customizable. This ability to finely customize the display of metrics is key to making the dataset easily consumable by users.
The eventual technical implementation was simple: enhancing the SignalFx catalog, which provides an easy interface for finding data and content, to show pre-curated dashboards. Let’s go over the main aspects of our new CloudWatch experience and the underlying principles we used to build it.
Coverage: Curated Dashboards for AWS Services
Hundreds of engineer hours went into building our expert CloudWatch dashboards for the most popular AWS services. We consider it time well spent‒every hour we spent building these dashboards translates into many hours of time savings for our users.
Usability: Content Easily Accessible From the Catalog
SignalFx customers already go to the catalog to search for anything in the system‒be it hosts, metrics, services or even dashboards. It made sense to make our new CloudWatch dashboards just show up there. Click on an AWS service namespace or instance, and you can view a customized, pre-populated dashboard for that service or instance right there. Easy to use and nothing new to learn.
Optimized for Monitoring Both Populations as Well as Individual Systems
One size does not fit all, and that applies to dashboards as well. A dashboard optimized for viewing a single instance will not work well (or at all) when viewing a cluster of instances. Because of this, we’ve built multiple specialized dashboards (two, three, or even up to five) per AWS service so users have the right for their context. For example, ElastiCache has clusters and instances with Redis or Memcache backends and Opsworks has stacks, layers and instances. A truly effective monitoring system must reflect this complex reality. Population analytics are used to make multi-instance, cluster-level dashboards more effective. Instead of showing a line for each member (which can be noisy), for instance, we used techniques like aggregates, percentiles, and TopN to provide more effective visualizations.
Smart Drilldowns: Using Dimensions and Tags/Properties
The catalog’s main purpose is to show all the dimensions and tags/properties available in the matching metrics. For example, select an AWS service and it shows you all individual instances of that service, the regions and AZs it exists in, as well as tags & properties applied to those instances. From there, adding the ability to navigate and filter our AWS dashboards was a no brainer. Now you can drill down into a particular AZ or tag and the dashboard intelligently updates itself to only show matching instances. The second way drilldowns are smart is you can use the service’s own hierarchy to drill down, e.g. ElastiCache clusters, Opsworks stacks, and Cloudfront distributions.
Consistency: Uniform Look and Feel Across All Services
A huge amount of time was spent in making all our AWS dashboards share a common look and feel. That goes into minute details like chart naming, axis labelling and chart types used (e.g. percentile distribution charts look and feel the same across all services). We chose specific colors for specific metric types so users will quickly get used to them and be able to grok the information with a quick visual inspection.
Charts Optimized for Common Monitoring Use Cases
Last, but certainly not least, we used principles like redundancy (show the same data in multiple different ways) and analytics techniques (timeshift, aggregations) to highlight metrics in different and useful ways that would enhance monitoring workflows used by our customers. Here are some examples.
- TopN charts to identify outliers in a population:
- Percentile distribution charts to get an overview of a large population:
- Percent variation charts to monitor how load balanced a population is:
- Historical trends to highlight change from past values:
- Flags to show your love for your country — joking of course, but stacked area charts do produce some great looking eye candy sometimes:
Minimal or No Configuration
All this content is automatically available, without the need for any configuration by users. All they need to do is enter the AWS credentials necessary for us to sync CloudWatch metrics and metadata into their SignalFx accounts as described here.
This represents a big step by us towards making SignalFx more comprehensive for our AWS customers. Please send us feedback and suggestions‒we hope to iterate quickly on this as we increase the coverage and quality of our CloudWatch experience.