As many of us are rediscovering an interest in board games, it feels relevant to make reference to Hasbro’s classic Clue. Understanding what’s going right or wrong in your sprawling digital business can feel a lot like a murder mystery: it was the authentication service in the east region with the memory exhaustion error.
This analogy has a weakness when applied to modern operations. The Clue board game had 6 weapons, 6 suspects, and 9 rooms. That’s 324 combinations. That is a lot for a board game, but tiny to a computer. The complication is that the business you’re running today is a complex system. A system built from a whole bunch of infrastructure, services, third-party dependencies, and customers that add up to a complicated, non-linear, emergent sort of entity that you and your operators have to tame. The possibilities are infinite! Furthermore, it's not just one “case,” it's a long term relationship where incidents are inevitable. Each one has something to teach us.
So Many Clues
As the complexity of our systems grows, that simplistic Clue joke falls apart. There are simply too many possible states for us to consider. We need more help from the computers and tools that we wield while investigating a case, on some CSI type stuff.
Engineers investigating a problem need help examining the huge amount of information our systems generate. The cases we’ll need to investigate cover a lot more than Clue’s 324 possibilities, so aid is required to pinpoint likely locations, flag anomalous behavior, and dive into specific customers or combinations. This technology needs to support humans and enable expertise as we learn from our investigations.
We’re gonna need a sympathetic, AI-wielding magnifying glass that can move at realtime speeds. Is there one of those back at the lab?
More Than The Usual Suspects
With the release of SignalFx Microservices APM, Splunk is moving beyond the paltry 324 combinations and empowering operators to see every possibility with NoSample™, full-fidelity tracing. This means that you can capture every database query, customer ID and curious contraption that might cause trouble in your systems. Got a pesky customer problem that only manifests itself on rare occasions when nobody's looking? Those traces will be there for examination.
Leaning On The Lab
There’s a lot of great stuff to learn after an incident, but we can’t wait for critical information to make its way to the lab so that Bones can wow us all a week later with an expert analysis. Timely information showing anomalous behavior can direct an operator to the right information at the sharp end of the problem. With AI-Driven Directed Troubleshooting you’ve got the lab with you in real-time, examining clues and providing insight.
No More Stake Outs
Nobody likes staring at charts all day, even if you’ve got a partner to tell you jokes and keep you company. Instead, have the robots keep an eye on the suspects so that we can focus on more interesting work. With dynamic alerting in real-time of spikes and things like historic anomalies you can feel confident that if any funny business occurs you’ll be notified so you can hop in your patrol car and drive to a suggested location to nab the offender.
This detection capability is powered by SignalFx streaming analytics which are generated automatically — no extra work from you! — from incoming traces. Since every trace is analyzed, the SignalFx APM gives you a completely accurate view of the metrics like Rate-Errors-Duration.
Because the Smart Agent also adds selected span tags to every span captured, operators have the ability to see the interactions not only with the application but also with accomplices — the underlying infrastructure. Those additional tags allow SignalFx to identify which piece of infrastructure each trace span was executed on, render the key infrastructure metrics, and link to more complete dashboards for the underlying host or Kubernetes pod. You can also add your own tags to give you unique insights into your application and environment.
Case Closed… Or Is It?
We know that you and your organization are going to have plenty of mysteries in the future. It’s our hope that these tools empower all of your operators to be a compact, elite investigative unit for everything from The Case of the Rebooting machine all the way up to The Case Where One Customer Has This Weird Experience, all in real-time, fully detailed and at scale.