As companies grow, things like visibility into your cloud infrastructure, monitoring the uptime of your services, and collecting and analyzing data from a wide variety of applications all become critical to the success of the organization and its customers. An unfortunate and often unavoidable side effect of this growth, however, is that it can often put stress on existing systems, precipitating the need for updated technologies, which can grow and scale over time.
At Acquia, years of existing customer growth and new customer acquisitions required us to increase the size of our fleet on an almost hourly basis. From a business perspective this was phenomenal — an expected side effect of a thriving business — but from an operational perspective, it was an early indication that we would inevitably reach a size when our original open-source monitoring service would no longer scale to meet our needs.
With so many servers, services, and applications to monitor, we first started down the path of trying to build our own monitoring services four years ago. While that project was in flight, we also carved out time and resources to build “temporary” solutions to help us filter and handle the growing number of alerts generated by our existing monitoring systems. What we discovered in the years that followed was, to put it simply, it was exceptionally difficult to dedicate the necessary time, resources, and expertise to such an initiative when there are so many other needs and problems to address across our engineering organization. As a result, we eventually had to stop and ask ourselves, “Is this something we should even be doing?”
By the time we asked ourselves that question, we were no different than many other companies our size: we had more than a dozen different systems across various teams for monitoring and analysis; there was no central management or controls for those systems, leading to inconsistent metrics and interpretations of data across products; and new teams repeatedly found themselves spending weeks on the evaluation, implementation, customization, and maintenance of new monitoring services, all while our primary products and teams continued to use our imperfect legacy solution.
The resulting pain was felt at all levels of Acquia, with teams across the organization and around the world experiencing toil and blockers due to monitoring service limitations and the bandwidth constraints they caused for our engineers. They simply could not find or interpret the data they needed with consistency, efficiency, or ease, and at the same time, we could not even provide our customers with all of the essential server health metrics they needed to optimize the uptime and performance of their applications.
In short — we needed a new solution.
Choosing the Right Monitoring Service
We didn’t just need a new monitoring service — we needed one that could handle all of our complex use cases as quickly as possible and on a tight budget. With those things in mind, we identified three possible paths we could take:
- Build a new monitoring service from scratch, in house;
- Take an existing open source solution and customize it to suit our own needs; or,
- Go the SaaS route and find a company/product that excels in this arena, allowing us to focus on what we do best while they focus on doing what they do best.
All three options had positive and negative attributes. Although Option 1 would allow us to address all of our needs precisely the way we wanted to, we estimated that it would take the most time and money to accomplish, and we would need to permanently dedicate engineering resources to maintaining and improving whatever we ended up building. Option 2 would require less effort than Option 1, but it would still require us to maintain the services and be responsible for upgrading them over time. Option 3, however, represented a current industry trend, where more and more companies are moving away from custom-built, in-house services in favor of plug-and-play solutions.
Option 3 seemed to make the most sense for us. A SaaS offering would provide us with a readily-available service with 24/7 support, a guarantee of new features and innovations on a regular cadence, and the ability to customize the service to suit our needs.
Making that decision was the easy part — figuring out which SaaS monitoring service to entrust with a fleet as large as Acquia’s was a great deal more difficult. When it came to choosing a SaaS monitoring service, we did not want to limit our focus to the technical features and capabilities of an offering — we also wanted to look at the company behind the services. Everyone claims they can solve your problems, but how do you know who truly is the best fit for your organization?
So when evaluating SaaS companies, we considered the following questions:
- How would we implement this solution, from install and initial customization through to feature configuration?
- How much work would it be to maintain the service long-term?
- What limitations does the service have, and are they deal breakers?
- What is the vendor’s support plan and SLA?
- Are they a startup or an established company?
- What are other people saying about them? Are they often recommended?
- What is the cost?
In our evaluation of more than a dozen possible solutions, we narrowed our options down to three companies with the features, reputations, and price ranges we were looking for. From there, we needed to look at what set each company apart from the others. With more than 15,000 instances in our fleet, our primary concern was that none of these services would be able to ingest the volume of data (millions of data points per minute) we would be sending. Needless to say, when two of the three vendors were willing to let us test their services on our entire fleet for free, that showed us how confident they were in their services.
At this point, we also began seeking out reviews from the current customers of each vendor. This led to one surprising find we were not expecting — that the more established and popular service was actually very poorly recommended by the existing customers we spoke with. In their reviews, these customers cited concerns about the product’s performance issues at scale, as well as the company’s lack of responsiveness to feature requests and bug fixes.
One final concern we had was the age of the companies we were working with. On the one hand, we could entrust our fleet and years of investment in a company that was considered an industry leader in the monitoring space. One the other hand, we could invest in a company with a beautiful, innovative service but limited experience and only preliminary customer reviews. In between was a company with some experience, great reviews, and lots of room to grow. When we considered that the most established contender was not well reviewed by some existing customers, and then considered the fact that we would need to cope with the growing pains of the youngest company, the third company in the middle was considered the safest option from a liability perspective.
Keeping all of this information in mind, our final choice was SignalFx. With competitive pricing based on the number of metrics we send each minute, we could fine tune our usage and control our costs based on our evolving needs over time. Their functionality was also very close to what we needed out of the box, their customer reviews revealed genuine excitement about their services, and they assured us that we could provide routine feedback on new features and their roadmap to ensure our most critical needs were met.
SignalFx Results (So Far)
SignalFx is a SaaS monitoring service which ingests, renders, and analyzes large volumes of server and application data. It also has advanced alerting and notification functions, which can be triggered whenever thresholds you define are breached. With a variety of possible integration mechanisms available to us, Acquia has predominantly been using the SignalFx fork of an open source monitoring agent called collectd. This has allowed us to add, enable, and customize any plugins we need to keep a close eye on the specific services running on our fleet (MySQL, Nginx, Varnish, etc.).
Where we were previously monitoring a small sliver of essential server operations across our fleet, we are now able to send and analyze nearly 300 metrics with four-times more granularity that we had before. With the insights we’ve gained, we’ve been able to identify and remediate more than a dozen issues and inefficiencies in our fleets, allowing us to save more than $600,000 per year in hardware expenses. We’ve also been able to improve the overall quality of the services we provide, increasing our engineers’ visibility into the health of the fleet and specific customers, and consolidating the number of monitoring services our teams need to use.
What’s Next for Acquia and SignalFx
With the swift implementation of SignalFx across our fleets, our teams have been able to focus on optimization, not building and maintaining a monitoring system of our own. As we near the final stages of getting everything we need sent over to SignalFx and configured properly, we have already started looking ahead and planning out all of the new and exciting features we have been eager to build, including:
- A new and improved StackView UI for our customers that will allow them to see essential server health metrics and any key events which might have affected server or application performance.
- Incident auto-remediation mechanisms that will eliminate wasted manual effort by our internal teams when common issues are detected.
- New automated diagnostic tools that will standardize and streamline our incident response process internally, reducing time-to-resolution when problems arise.
- Predictive monitoring and alerting mechanisms so we can catch, investigate, and resolve anomalous trends in key server health metrics before the customer’s services are affected in any way.
At Acquia, we are immensely proud of our ability to provide customers with best-in-class monitoring and diagnostic services, giving them peace of mind while they focus on building and optimizing mission-critical applications. With SignalFx available across our fleet now, our products and services will only keep getting better.
Join our live weekly demo on cloud monitoring »
Mike Klaczynski and Aaron “Checo” Pacheco
This is a guest post by Aaron “Checo” Pacheco of Acquia, originally published on the Acquia blog.