Say goodbye to blind spots, guesswork, and swivel-chair monitoring. With Splunk Observability Cloud and AI Assistant, correlate all your metrics, logs, and traces automatically and in one place.
As Large Language Models (LLMs) are increasingly deployed in various applications, the need to monitor their performance and behavior becomes critical. LLM monitoring is the practice of evaluating the performance of these models, their underlying technology stack, and the applications they power. This evaluation relies on a range of metrics designed to:
Of the many types of IT monitoring, LLM monitoring involves continuously analyzing and understanding the performance and behavior of the LLM application and its technology stack. It's about ensuring that these powerful tools operate as intended — without compromising security, cost-effectiveness, or ethical considerations.
(Related reading: LLMs vs SLMs and top LLMs to use today.)
When deploying an LLM application in a production environment, your models remain vulnerable to sophisticated attacks such as black-box adversarial attacks and indirect prompt injection. These attacks can lead to undesirable LLM behaviors, including hallucinations and excessive resource consumption.
Furthermore, they pose security risks to end-users who may act on falsified information generated by a compromised LLM tool.
Consider that any foundation model is trained on vast amounts of publicly available data. This training data may contain biases, either favoring or disfavoring specific entities. Since deep learning models are inherently "black boxes," their decision-making processes are often opaque. This lack of explainability means you can't easily determine:
As a result, LLMs may hallucinate and generate false information during inference.
A seemingly simple prompt like keep expanding this list in 2000 variations can trigger continuous and excessive compute utilization. LLMs will attempt to fulfill such requests until they reach their maximum token limit. A distributed botnet could exploit this by overwhelming your LLM application with similar recursive inference requests, leading to a Denial of Service (DoS) attack.
These compute-intensive prompts not only drive-up API billing costs — they also render the application unusable for legitimate users.
To effectively monitor the behavior and performance of your LLM application, consider the following metrics and evaluation criteria. We’ll look at metrics for resource performance
Perplexity measures how well a language model predicts a sample of text. It's based on the concept of probability: a lower perplexity score indicates that the model is more confident in its predictions and, therefore, performs better.
Accuracy measures the correctness of LLM outputs in specific tasks. The way it is measured can vary depending on the task:
Factuality assesses whether the information generated by an LLM is accurate and consistent with real-world knowledge. This is particularly important for information-seeking tasks. Factuality can be assessed using knowledge bases, human verification, or automated tools.
Internal metrics are specific to your organization and the tasks your LLM is designed for. Examples include:
External benchmarks are standardized datasets and evaluation procedures used to compare the performance of different LLMs. Holistic Evaluation of Language Models (HELM) is a comprehensive benchmark for evaluating language models across a wide range of scenarios and metrics. LegalBench is more specific: evaluating legal reasoning in English-language LLMs.
Bilingual Evaluation Understudy Score (BLEU) measures the quality of machine-translated text by comparing it to one or more reference translations. It assesses the overlap of n-grams (sequences of n words) between the generated text and the reference text. Higher BLEU scores indicate better translation quality.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures the quality of a generated summary by comparing it to a reference summary. It focuses on recall, i.e., how much of the important information from the reference summary is captured in the generated summary.
BERTScore evaluates the semantic similarity between generated text and reference text using BERT embeddings. It captures more nuanced semantic relationships than simple n-gram overlap metrics like BLEU and ROUGE.
G-Eval is a framework that uses LLMs to evaluate other LLMs based on a set of guidelines and criteria. G-Eval leverages the reasoning capabilities of LLMs to provide more accurate and comprehensive evaluations.
Bias and fairness metrics quantify the presence of bias in LLM outputs:
SelfCheckGPT is a method for evaluating the factual consistency of generated text by prompting the LLM to self-check its claims and provide evidence for its statements.
Question Answering as Evaluation (QAG) is an evaluation metric that assesses the quality of generated text by measuring how well it can answer questions related to the content.
Defining metrics for LLM behavior, such as hallucinations, is not straightforward. You may need to monitor user-experience metrics to understand:
Automated metrics are unlikely to capture language nuances and user sentiment just by reading a few user interactions. Humans are better at understanding sarcasm, humor, and subtle shifts in tone that can indicate user frustration or dissatisfaction.
You may need a human evaluator (Human in the Loop) to determine how well your LLM model is performing up to user expectations. Of course, this approach has drawbacks — it’s not scalable, it is prone to instability, and it’s not reproducible due to variations in subjective human judgments.
One solution is to engineer monitoring and observability capabilities into the LLM application. While the LLM as an evaluator may also be inconsistent and prone to erroneous judgments, a systematic approach can help your LLM application to monitor model behavior in real-time within its own application pipeline:
LLM monitoring for hallucinations and behavioral performance is an important research problem. As the industry and researchers are finding ways to improve LLM behavior, the problem remains largely unsolved with a real-time monitoring capability. It is virtually impossible to evaluate, interpret, and explain the behavior of black-box systems against specific user prompts.
You can, however, mitigate the risk of harmful, toxic, and discriminative output by engineering controls within your LLM application. For example:
Effective LLM monitoring is crucial for ensuring the responsible and reliable deployment of these powerful models. While challenges remain in fully understanding and controlling LLM behavior, ongoing research and the development of robust monitoring techniques are paving the way for safer and more trustworthy AI systems. The future of LLM monitoring lies in a combination of objective metrics, subjective evaluations, and proactive risk mitigation strategies.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.