What does digital resilience mean in the context of AI? Is it all about reliable model outputs? Transparency? Elasticity to meet demand? Back-ups and failure modes? Secure and trustworthy systems?
Most likely it is a mixture of all these things (and more) in your organisation, but defining what good looks like and how to get there is often a challenge. So, what is it about AI that is any different to other technologies you may ask? As always, the devil is in the detail, and this is what we will look at in this blog post.
Identifying how and why an AI system has produced a given output is hard, making it difficult to quantify predictable business impacts and, therefore, risk from these systems.
Don’t just take our word for it though, let’s first look at how a typical AI system is composed. The AI toolchain often involves different components and infrastructure to other types of software – such as AI models or GPU compute. Running these types of technologies requires specialist skills and knowledge, as well as some subtly different approaches to monitoring (which we will touch on later).
Added to this, AI systems are often nondeterministic, especially those that use Large Language Models (LLMs). This means that the outputs of AI systems are harder to predict, which can make it unclear if the system is behaving as expected or not.
Another factor with AI is that the models themselves are difficult to understand. There are very few organisations that are taking the efforts to train their own LLMs, more usually the approach is to take a pre-trained open source or commercial LLM. Even for traditional machine learning, organisations that train their own models often rely on a small number of data scientists to look after the model operation workflows.
While all of this means that identifying why and how an AI system has produced a given output is hard, what you can do is come up with a good monitoring strategy for seeing how your AI system is performing.
Thankfully most of what you do for monitoring other systems is relevant to AI systems. Figuring out how deep down the AI rabbit hole you want to go is the first step in the process…
Don’t leap headfirst into that rabbit hole though, you still need to practice all the good cybersecurity, performance and available hygiene as you would for any other system. We’ll look at some of the questions you should think over for security and observability next so you can take a measured pace toward making your use of AI resilient.
One of the biggest changes from adopting AI in your organisation is the growth of non-human user identities using agentic AI. Couple this with machine speed workflows and you have potential for a big security headache… Therefore, you need to start by considering if your operations are prepared for securing these new types of entities? This could be as simple as having some tagging in your asset and identity framework so these types of entities are visible, or it could be much more complex – such as creating bespoke detections or incident workflows for agentic entities.
Another challenge is the new types of threats that you may face from AI based systems, such as prompt based attacks or backdoors in models. Here you should consider what types of risks do you expect from adoption of AI in your organisation? Thankfully to help you out there has already been some excellent work from security research organisations, such as the GenAI Security project from OWASP and the MITRE ATLAS project. These projects are fantastic resources for understanding the unique security threats that you might face when adopting AI technologies. Don’t treat these projects as a list of must haves, however, instead think through the risk surface each AI system presents and be selective over which requirements are necessary.
The big question here is whether you want to monitor around the black box or inside the black box? If you start by monitoring around the black box then all the KPIs that you would normally think about for application monitoring are relevant – such as request rates, error rates and response times. Add into this mix some baselines for the types and volumes of responses coming out of the AI system and you have a good space to build from. Even just response volume-based monitoring is a great start (remember these systems are nondeterministic!).
When it comes to observing AI inside the ‘black box’ life is a little simpler thanks to the wonderful world of OpenTelemetry, in particular OpenLLMetry. Here you can instrument the collection of all sorts of interesting metrics, down to the actual prompt and response traces.
With most of the monitoring possible when it comes to AI, the question is how critical is the AI system to your business? Depending on the answer to this will determine the depth of the monitoring you need to deploy. One current trend in this space is ‘tokenomics’, using observability to show back to the business the cost or revenue associated with an AI service.
If you want more details on how you can secure and observe your use of AI then we have a lot of additional reading for you! If security is your thing then start with our article from Surge looking at the OWASP top 10 for LLMs, analysing which requirements can be implemented in Splunk with some examples too. If you’re keener on observability then take a look at this great piece on monitoring generative AI applications using OpenLLMetry with some step-by-step guidance.
Finally, if you want to hear from us in more detail on all these topics then join us for EMEA Digital Resilience Week!
Happy Splunking.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.