Care.com Refactors Monoliths Into Microservices With Splunk Observability

Care.com Care.com

With a vision to centralize monitoring and telemetry data, Care.com needed an observability solution that could provide granular visibility to refactor its monolithic architecture and disparate systems into microservices.

With Splunk as the single platform and source of truth for all its observability needs, Care.com holistically understands its entire environment to find and fix errors faster, improve application architecture and accelerate feature releases.

How do you connect trusted caregivers with people who need it most?

Enter Care.com, the world’s leading platform for finding and managing high-quality family care. Fueled by the mission to “provide care for all you love,” this two-sided marketplace connects vetted caregivers with those looking for care for children, seniors, pets and more. Care.com even offers household tax and payroll services to make it easy to file taxes and provide legal pay associated with being a household employer.

Millions of families and caregivers in more than 20 countries depend on Care.com, so the platform must be available, scalable and reliable around the clock. After the IAC acquisition of Care.com in early 2020, the organization’s immediate priority was to modernize and centralize the Care.com stack and infrastructure — which meant breaking down 13-year-old monoliths into agile microservices.

To ensure success, Care.com needed a powerful observability solution that could provide visibility into its architecture, efficiently pinpoint issues and provide enough flexibility to experiment. That’s why the team turned to the Splunk Observability Cloud.

Turning Data Into Outcomes
  • Accelerated mean time to investigate and resolve incidents from an hour or more to less than 10 minutes
  • Gained better visibility into complex containerized infrastructure and simplified troubleshooting with full-fidelity tracing and no sampling
  • Gave developers the ability to release features on time, with more frequency and greater confidence
Splunk Observability Cloud demo
Download Splunk for free or check out the Splunk Observability Cloud demo.

OpenTelemetry as the North Star

Centralizing Care.com’s infrastructure provides benefits that span across the business, from improving scalability and dismantling silos to delivering a flexible, test-driven experience to end users.

Yet the organization’s architecture, which included 13 years’ worth of disparate systems inherited from numerous acquisitions, was a complex puzzle all its own — and successfully understanding and replatforming it into microservices required savvy technologists and best-in-class tools. “The observability that Splunk provided was key,” says Sean Schade, principal architect on Care.com’s core architecture team. “One of the first things we did was turn on tracing and telemetry to understand how the pieces of this giant puzzle fit together.”

Being an early adopter of OpenTelemetry gave Care.com a distinct advantage when tackling a project of this size and scope. “OpenTelemetry is one of the North Stars of our architecture because of the insight it provides,” says Schade. “We could bake OpenTelemetry into our architecture from day one because we have Splunk, who is the number-one contributor to OpenTelemetry and way ahead of the curve on this.”

OpenTelemetry has also helped Care.com establish common standards to enhance collaboration, data democratization and productivity among engineers and developers. Schade says, “We now have a standard data format, which helps simplify exporting and integrations. We don’t need to go to five different vendors with five different formats to ingest that data; we have it all in one place.”

One Platform for Full-Fidelity Data and Faster Answers

Centralization is vital to Care.com’s teams — and having the Splunk Observability Cloud as the single platform to fulfill their observability needs has been key to success.

“If we’re flying blind, the cost is unlimited. Splunk has made my job as an architect easier because I can understand how our system is operating,” says Schade. “Microservices architecture is difficult to begin with, but if you don’t have APM or observability, you can be stuck burning countless hours and resources. The Splunk Observability Cloud gives us the visibility we need into our microservices, and allows us to see everything in one place and correlate, which is invaluable.”

“As systems become more complicated, you start to ask questions that you can’t answer because you forgot to collect, say, a metric. That’s where modern observability comes in,” says Senior Director of Devops Engineering Matt Coddington, who manages the DevOps team at Care.com. “One of the big values for my team is that the Splunk platform handles ephemeral infrastructure really well, making it easy to see metrics as containers and servers come up or down. Splunk Observability Cloud captures all the logs, metrics and traces in a way that allows us to understand any event across our platform, so we can ask questions and get answers.”

Care.com has been working toward capturing all of this data for awhile — but Splunk’s full-fidelity tracing is what made it possible. “You need 100% of your data for observability to be effective,” says Schade. “I don’t know how anyone can compete with Splunk’s no sampling. That’s been the biggest issue with any APM product I’ve used in the last seven or eight years.”

Splunk Observability Cloud captures all the logs, metrics and traces in a way that allows us to understand any event across our platform, so we can ask questions and get answers.
Matt Coddington
Senior Director of DevOps Engineering, Care.com

Splunk APM Accelerates Troubleshooting and Feature Releases

Thanks to Splunk Observability Cloud, Care.com has improved its mean time to identify and resolve a problem. Where solving a problem may have taken an hour or more prior to Splunk, the team now finds and fixes issues within minutes through unique Splunk APM features like Tag Spotlight. “Tag Spotlight is huge for troubleshooting because we can filter different dimensions in real time to answer questions around errors and latencies,” says Coddington.

One such question arose when the Care.com team saw huge spikes in response times at specific times of day on specific pages of the site. Coddington and his team relied on Splunk APM to understand which parts of the system were being overwhelmed and what they had in common. After discovering that the culprit was silent mobile push notifications, the team worked with their vendor to rate limit and return to business as usual.

Full-fidelity data proved especially valuable when Care.com sought to release new features during the organization’s back-to-school push last fall. “We had a hard deadline, brand new architecture and a lot of firsts, including the first time releasing anything on Kubernetes and the first time using GRPC services,” says Schade. “I don’t think we would have been able to release our features without Splunk APM because we wouldn’t have had the ability to see if the product was working and troubleshoot any unforeseen issues.”

From Better User Flows to Faster Load Times

From helping families find care to setting up a household payroll schedule, Care.com seeks to provide an exceptional user experience at every step. With Splunk Real User Monitoring (RUM) on Care.com’s enrollment application, the team can measure performance for complete transactions that span web browsers on the frontend to backend service dependencies — which has allowed them to identify and optimize poor page performance from slow third-party dependencies.

“We can now correlate backend traces from APM with frontend traces from RUM. That’s a huge value because that’s been our missing link,” says Schade. “It’s been very illuminating and has revealed hidden inefficiencies that we’re now able to address.”

The team’s next goal is to apply Splunk RUM to more flows across the business to further optimize page load times and improve users’ experiences. “We have a lot of questions from the business, like where users are coming from or how long they’re staying on something,” Schade says. “We’re looking forward to using RUM to help answer these questions.” Coddington echoes, “Splunk RUM is the product that ties into all the rest of the telemetry we have, and that’s a powerful thing.”

While Care.com has handled increased demand during the COVID-19 pandemic, the platform is looking at another year of high growth and demand. In addition to completing centralization, Coddington is aiming for a ten-fold increase in release frequency. To achieve that, his DevOps teams will use Splunk to see issues as they occur, set up automatic alerts and track release events over time. When releasing, teams will also leverage Splunk Log Observer’s live tail feature to see, in real time, if a service isn’t working properly.

Schade says, “It’s been a really good partnership between Splunk and Care. com. Splunk has helped us achieve and fast-track a lot of goals with our architecture that we couldn’t have otherwise accomplished — and we look forward to what’s next.”

I don’t think we would have been able to release our features without Splunk APM because we wouldn’t have had the ability to see if the product was working and troubleshoot any unforeseen issues."
— Sean Schade, Principal Architect, Care.com
Industry: Online Services
Do More With Splunk