In this post, Stan Chan, Head of Core Infrastructure at Symphony Commerce talks to us about how they’re using SignalFx to create the operational metric system for the whole organization.
Symphony Commerce delivers enterprise-level commerce as a service to today’s fastest growing brands. Symphony’s handles critical wholesale and retail business workflows from commerce applications to fulfillment. This allows brands to focus on core functions like building products and nurturing customer relationships. Symphony’s commerce services come together seamlessly to deliver intelligent and personalized experiences across the customer lifecycle, so brands can deliver a unified, branded commerce experience. With pay-per-use model, Symphony’s success is tied to its brands success; there are no setup costs or hidden fees.
Can you tell us about Symphony and your team?
Symphony basically brings established, enterprise-level commerce capabilities to small and medium-sized businesses (SMBs). Our vision is to democratize commerce by delivering commerce infrastructure previously available to only large companies to any brand in the world. We have about 70 employees – majority of us are engineers, product managers or designers. There are six people in core infrastructure team. My team is focused on the platforms used by everyone else, including all infrastructure configuration, deployment, monitoring, and operations. We work together with other engineering teams to optimize platforms for them and help them deploy and manage their own services.
Can you tell us a little bit about the nuts and bolts of your application?
We use mostly Java and Scala for the back end, AngularJS for the front. Symphony runs 100% on AWS, scaling anywhere from 50 to many 100s of VMs depending on load and how customers are using the system. We run EC2, S3, ELB, ELC, RDS, CloudFront, CloudFormation, and have just started using the new EC2 Container Service for our Docker deployment. Some of the other technologies we use include: Elasticsearch, Cassandra, Zookeeper, and Kafka.
What kind of challenges do you face with monitoring?
We had some gaps with our previous monitoring setup, a mixture of check based and other commercial metrics tools. First, we couldn’t get access to all the metrics we desired because of restrictions on the types or numbers of metrics that those tools could handle. For the metrics we did have, we could not get aggregations like percentiles. Finally, we couldn’t get metrics at a fine enough resolution to actually make timely decisions and catch problems before they had too much of an impact on customers.
What does your monitoring stack look like now?
We use CollectD to gather infrastructure metrics and SignalFx’s Java client library to instrument metrics directly into Symphony code. Everything gets sent into SignalFx for production monitoring of systems and services. For code level stack traces and performance monitoring we use AppDynamics.
How do you use SignalFx?
SignalFx provides an operational metric system for the whole organization. First, we use it for infrastructure metrics and analytics for everyday monitoring. But second, and more importantly, we’re using it to create an operational metric system for the whole organization that combines business metrics, application metrics, and infrastructure metrics together — to get a real time sense of how all of Symphony is doing.
We’re sending in interesting application and business metrics like the number of page views at a time, number of add-to-carts, number of orders placed, number of shipments placed, GMV, and so on.
On all our metrics, we frequently use percentiles and moving averages, particularly in combination with the “timeshift” capability to compare week-over-week and day-over-day changes in real time as the data comes in. Along with “timeshift”, the ability of SignalFx to consume and operate on multidimensional metrics has been tremendously useful. We’re able to look at data like request latency on a per customer or per AZ basis, or a combination thereof. This was impossible to do in any other product we tried.
We find correlations as soon as problems arise, regardless of whether due to a client action or whether due to some internal event or change.