How Sunrun Uses SignalFx to Create Meaningful Alerts

By Splunk

In this post we’ve excerpted some of our conversation with Florian Berckemeyer, Manager of DevOps at Sunrun, about how they use SignalFx to reduce alert noise and fatigue. Read the full case study here!

Sunrun is the largest residential solar company in the United States with over 100,000 customers (and growing), and has deployed more than $2 billion in solar systems. We allow consumers to get a fast, personalized solar energy system designed for their home using our seamless “solar as a service” process.

Can you tell us more about your applications?
Sunrun relies on four main applications — Design and Pricing apps to help customers start planning their solar systems, SalesForce.com for backend customer and order management, and solar / electricity metering and monitoring systems.

The Design and Pricing apps are software created and maintained by Sunrun engineers and are differentiating for the company. Both were written in the last three years to run completely on AWS. Since then, all new software we develop is built on AWS using Java, MySQL (moving to Amazon Aurora), and Javascript (moving from Bootstrap to React). We have just started down the path of containerization, breaking the apps into microservices, and moving our entire engineering organization into a DevOps model.

What are the monitoring challenges you face?
As we have shifted to using a modern stack and push our engineering team towards a DevOps model with a microservices approach, it’s become clear that the existing combination of APM, logs, and low fidelity metrics tools are insufficient for operating modern apps.

You set up an alert in our APM vendor and you get 50 emails for an alert. You get this alert fatigue. There were three main reasons we decided to search for a new monitoring solution:

We were suffering from alert fatigue because our existing solution was unable to create sophisticated alerts
We weren’t able to catch failures because our existing tools couldn’t handle handle data at high resolution or monitor custom application metrics
The high cost of APMmade it prohibitive to deploy to for all apps, especially the most critical ones that scale out on demand on AWS

Why SignalFx?
With SignalFx, we can now create meaningful alerts with dynamic thresholds on derived metrics such as growth rates, instead of static thresholds against individual system-level metrics like current CPU load.

SignalFx gives us a sophisticated way to make an alert that is meaningful and not just generating false positives. SignalFx’s ability to consume and analyze every kind of metric has provided us with a single place to monitor, correlate, and alert all metrics.

And instead of having to pay per instance we pay SignalFx by the amount of metrics monitored. This made it cost-effective to deploy to the entire fleet.

How do you use SignalFx?
Sunrun uses SignalFx for monitoring the Design and Pricing apps. We correlate systems, platform, and custom application metrics for both applications in real time across multiple dimensions. For example, we compare JVM performance metrics across different code versions as they move from development to production.

We use SignalFx’s integration with AWS to pull in AWS-specific metadata and SignalFx’s open source Java client library in combination with the metrics-cdi to create custom application metrics.

How has this made your life better?
SignalFx has made it possible for us to catch, troubleshoot, and prevent recurrence of multiple causes of poor performance before they impact reps, customers, and revenue to the business. We would like to get to a point where all of engineering is using SFx.

Thanks,
Kim Harrison

Splunk RUM Frontend Error Monitoring is Now Generally Available!

Splunk RUM Frontend Error Monitoring Helps engineering teams quickly scope, prioritize, and isolate their most impactful customer-facing JavaScript errors.

DevOps 10 Min Read

Writing Ansible Playbooks for New Terraform Servers

See how you can write Ansible Playbooks for Terraform servers. With this technical tutorial of Ansible and Terraform together, DevOps and IT operations teams can execute playbooks faster and maintain a resilient CI/CD pipeline.

DevOps 6 Min Read

Observability and Telecommunications Network Management [Part 1]

When considering the management of telecommunications networks, it might make sense to directly consider the observable data streams available to the practitioner instead of a model-based approach. Learn more in this blog.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

How Sunrun Uses SignalFx to Create Meaningful Alerts

Related Articles

Splunk RUM Frontend Error Monitoring is Now Generally Available!

Writing Ansible Playbooks for New Terraform Servers

Observability and Telecommunications Network Management [Part 1]

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram