
In this post we’ve excerpted some of our conversation with Florian Berckemeyer, Manager of DevOps at Sunrun, about how they use SignalFx to reduce alert noise and fatigue. Read the full case study here!
Sunrun is the largest residential solar company in the United States with over 100,000 customers (and growing), and has deployed more than $2 billion in solar systems. We allow consumers to get a fast, personalized solar energy system designed for their home using our seamless “solar as a service” process.
Can you tell us more about your applications?
Sunrun relies on four main applications — Design and Pricing apps to help customers start planning their solar systems, SalesForce.com for backend customer and order management, and solar / electricity metering and monitoring systems.
The Design and Pricing apps are software created and maintained by Sunrun engineers and are differentiating for the company. Both were written in the last three years to run completely on AWS. Since then, all new software we develop is built on AWS using Java, MySQL (moving to Amazon Aurora), and Javascript (moving from Bootstrap to React). We have just started down the path of containerization, breaking the apps into microservices, and moving our entire engineering organization into a DevOps model.
What are the monitoring challenges you face?
As we have shifted to using a modern stack and push our engineering team towards a DevOps model with a microservices approach, it’s become clear that the existing combination of APM, logs, and low fidelity metrics tools are insufficient for operating modern apps.
You set up an alert in our APM vendor and you get 50 emails for an alert. You get this alert fatigue. There were three main reasons we decided to search for a new monitoring solution:
- We were suffering from alert fatigue because our existing solution was unable to create sophisticated alerts
- We weren’t able to catch failures because our existing tools couldn’t handle handle data at high resolution or monitor custom application metrics
- The high cost of APMmade it prohibitive to deploy to for all apps, especially the most critical ones that scale out on demand on AWS
Why SignalFx?
With SignalFx, we can now create meaningful alerts with dynamic thresholds on derived metrics such as growth rates, instead of static thresholds against individual system-level metrics like current CPU load.
SignalFx gives us a sophisticated way to make an alert that is meaningful and not just generating false positives. SignalFx’s ability to consume and analyze every kind of metric has provided us with a single place to monitor, correlate, and alert all metrics.
And instead of having to pay per instance we pay SignalFx by the amount of metrics monitored. This made it cost-effective to deploy to the entire fleet.
How do you use SignalFx?
Sunrun uses SignalFx for monitoring the Design and Pricing apps. We correlate systems, platform, and custom application metrics for both applications in real time across multiple dimensions. For example, we compare JVM performance metrics across different code versions as they move from development to production.
We use SignalFx’s integration with AWS to pull in AWS-specific metadata and SignalFx’s open source Java client library in combination with the metrics-cdi to create custom application metrics.
How has this made your life better?
SignalFx has made it possible for us to catch, troubleshoot, and prevent recurrence of multiple causes of poor performance before they impact reps, customers, and revenue to the business. We would like to get to a point where all of engineering is using SFx.
Thanks,
Kim Harrison