Clever is the platform that powers the classrooms of tomorrow. Founded in 2012 by educators and technologists who knew that widely available educational apps could improve both teaching and learning, but that tools to deploy and secure the applications were simply unavailable. Today, one in three innovative K-12 schools in the U.S. trust Clever to secure their student data as they adopt learning apps in the classroom. In this post, we talk with Mohit Gupta, Product and Engineering Lead for Infrastructure at Clever.
Can you tell us more about Clever the app and your team?
Clever has become integral to educational infrastructure in the US, serving millions of students and educators at more than 50,000 schools. The platform connects those schools with more than 200 learning applications running on anywhere from 500-1000 hosts running completely on AWS.
We use services such as EC2, RDS, DynamoDB, ELB, Kinesis, Redshift, and Route53 and open source components like Docker, Mesos, Marathon, etcd, Flannel, Consul, Elasticsearch, Mongo, and Redis. Originally written in Node.js/Python, we’re moving Clever to Go based services.
What are the challenges you face?
Nikhil Pandit and I lead the centralized infrastructure team here at Clever. As we’ve built out the operations stack, we’ve made sure the infrastructure lets developers easily create their own metrics so they can track the performance their own services. And do their own operations.
But creating metrics is only part of what’s needed. They also need to be able to visualize and alert on those metrics in a way that’s easy for them, as opposed to for a traditional operations team. That means:
- Self-service: so they can create and manage their own service’s metrics and alerting – without having to rely on other teams or learn specialized query languages
- Flexibility to apply their own analytics for alert creation: since they know what matters for their services, being able to do things like latencies across a service or day-over-day changes in number of logins in real-time is important
- No reconfiguration due to infrastructure changes: since we frequently scale automatically by orders of magnitude, even within a single day, visualizations and alerts should never need to be reconfigured
Before SignalFx, we did not have a clear way to go from a metric in app code to creating an alert.
We looked at other solutions–we even considered building on top of the logging tools and other projects were already using–but found that all of those were intensely manual and produced noisy alerts. We needed high quality alerting with some kind of analytics to reduce the noise but still find problems quickly.
SignalFx was the only service that could create alerts based on real-time analytics against our metrics and give us the ability to track performance and behavior at the cluster or service level, instead of individual systems, while being fast enough to catch issues before they impact users. Developers and service teams could get real-time visibility into user-centric metrics, which are core to the business, and run whatever analytics they wanted without having to learn a new query language.
How do you use SignalFx?
Almost all of our alerts are based on derived metrics from custom analytics in SignalFx.
We believe that first order monitoring and alerting should be based on user impact, so the primary metrics sent into SignalFx are custom metrics like request load, latencies on application actions, and number of logins. These have the greatest impact on our business and get the most focus from all teams. Next we’re interested in anything that would impact capacity and performance, so platform metrics like Mesos and Docker stats or message queue lengths. And finally, system metrics make up the rest.
To get data into SignalFx we’ve built our own metrics pipeline using the Diamond collection agent with a SignalFx plugin (which we wrote), Heka, and our own client libraries for app metrics. All alert creation is done in SignalFx and notifications are pushed into the right service team’s Slack channel, sent to PagerDuty, and sometimes also emailed or sent to other systems via Webhooks.
Everyone at Clever uses SignalFx, but each person is responsible for managing the metrics they want to instrument, monitor and alert against.
How has this made your life better?
We’ve had an amazing year! A big part of keeping Clever performing has been the alerts and metrics that let us see and respond to the impact of growth on our systems--which would not have been possible before SignalFx.
As an engineering and operations organization, we’ve learned how to build powerful and more meaningful alerts using analytics–reducing the noise that leads to alert fatigue and the operational failures that pester many SaaS companies.
Like many other companies, we experience significant seasonal demand that makes us subject to huge variances in load. Our load changes based on the school calendar, like when districts open for the year, or when schools start for the day, rolling across regions. We found that as we grew, almost every third party we depended on couldn’t grow with us without failing. Except for SignalFx.
SignalFx has also helped us get more proactive about preemptive scaling, load testing, and managing the process of reacting to the change in demand. Because of SignalFx, everyone, developers and operations, has learned how to do alerting well and can do it for themselves.