Why is monitoring LLMs important?

Monitoring LLMs is important to ensure their outputs are accurate, reliable, and safe, and to detect issues such as hallucinations, bias, or unexpected behavior.

What are some key metrics to monitor for LLMs?

Key metrics to monitor for LLMs include latency, throughput, error rates, user feedback, and model drift.

How can organizations monitor LLMs effectively?

Organizations can monitor LLMs effectively by implementing logging, tracing, and alerting systems, and by using tools that provide visibility into model inputs, outputs, and performance.

What challenges are associated with LLM monitoring?

Challenges associated with LLM monitoring include the complexity of model behavior, the difficulty of detecting subtle errors, and the need for specialized tools and expertise.

Learn

May 26, 2025

6 Minute Read

LLM Monitoring: A Comprehensive Guide on the Whys & Hows of Monitoring Large Language Models

Q: What is LLM monitoring?

LLM monitoring is the process of tracking and analyzing the performance, reliability, and behavior of large language models (LLMs) in production environments.

By Muhammad Raza

As Large Language Models (LLMs) are increasingly deployed in various applications, the need to monitor their performance and behavior becomes critical. LLM monitoring is the practice of evaluating the performance of these models, their underlying technology stack, and the applications they power. This evaluation relies on a range of metrics designed to:

Uphold AI safety and responsibility standards.
Control operational costs.
Optimize the utilization of computing resources.

What is LLM monitoring?

Of the many types of IT monitoring, LLM monitoring involves continuously analyzing and understanding the performance and behavior of the LLM application and its technology stack. It's about ensuring that these powerful tools operate as intended — without compromising security, cost-effectiveness, or ethical considerations.

(Related reading: LLMs vs SLMs and top LLMs to use today.)

Why should you monitor LLMs? The risks of unmonitored LLMs

When deploying an LLM application in a production environment, your models remain vulnerable to sophisticated attacks such as black-box adversarial attacks and indirect prompt injection. These attacks can lead to undesirable LLM behaviors, including hallucinations and excessive resource consumption.

Furthermore, they pose security risks to end-users who may act on falsified information generated by a compromised LLM tool.

Consider that any foundation model is trained on vast amounts of publicly available data. This training data may contain biases, either favoring or disfavoring specific entities. Since deep learning models are inherently "black boxes," their decision-making processes are often opaque. This lack of explainability means you can't easily determine:

Why an LLM produces a particular output
How it arrived at a specific conclusion

As a result, LLMs may hallucinate and generate false information during inference.

A seemingly simple prompt like keep expanding this list in 2000 variations can trigger continuous and excessive compute utilization. LLMs will attempt to fulfill such requests until they reach their maximum token limit. A distributed botnet could exploit this by overwhelming your LLM application with similar recursive inference requests, leading to a Denial of Service (DoS) attack.

These compute-intensive prompts not only drive-up API billing costs — they also render the application unusable for legitimate users.

How to monitor LLM performance: Metrics and evaluation criteria

To effectively monitor the behavior and performance of your LLM application, consider the following metrics and evaluation criteria. We’ll look at metrics for resource performance

Resource and performance metrics

Compute per Token or API Call measures the computing resources consumed for each token generated or API call processed. It includes factors like GPU memory usage, bandwidth, temperature, storage, and energy consumption. Monitoring these metrics helps optimize resource allocation and control costs.
Latency (response time) measures the time it takes for the LLM to generate a response after receiving a request. Lower latency is crucial for real-time applications.
Throughput measures the number of requests that the LLM can handle in a given time period. Higher throughput indicates better scalability and efficiency.
Error rate is the percentage of incorrect or failed outputs generated by the LLM.

LLM evaluation metrics and scorers

Perplexity measures how well a language model predicts a sample of text. It's based on the concept of probability: a lower perplexity score indicates that the model is more confident in its predictions and, therefore, performs better.

Accuracy measures the correctness of LLM outputs in specific tasks. The way it is measured can vary depending on the task:

Question Answering Accuracy is the percentage of questions answered correctly.
Text Classification Accuracy is the percentage of texts correctly assigned to a category.

Factuality assesses whether the information generated by an LLM is accurate and consistent with real-world knowledge. This is particularly important for information-seeking tasks. Factuality can be assessed using knowledge bases, human verification, or automated tools.

Internal metrics are specific to your organization and the tasks your LLM is designed for. Examples include:

Task-specific accuracy
User satisfaction rates
Escalation rates
How frequently a human-in-the-loop system triggers a human override

External benchmarks are standardized datasets and evaluation procedures used to compare the performance of different LLMs. Holistic Evaluation of Language Models (HELM) is a comprehensive benchmark for evaluating language models across a wide range of scenarios and metrics. LegalBench is more specific: evaluating legal reasoning in English-language LLMs.

Bilingual Evaluation Understudy Score (BLEU) measures the quality of machine-translated text by comparing it to one or more reference translations. It assesses the overlap of n-grams (sequences of n words) between the generated text and the reference text. Higher BLEU scores indicate better translation quality.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures the quality of a generated summary by comparing it to a reference summary. It focuses on recall, i.e., how much of the important information from the reference summary is captured in the generated summary.

BERTScore evaluates the semantic similarity between generated text and reference text using BERT embeddings. It captures more nuanced semantic relationships than simple n-gram overlap metrics like BLEU and ROUGE.

G-Eval is a framework that uses LLMs to evaluate other LLMs based on a set of guidelines and criteria. G-Eval leverages the reasoning capabilities of LLMs to provide more accurate and comprehensive evaluations.

Bias and fairness metrics quantify the presence of bias in LLM outputs:

Disparate impact measures whether different groups of users receive different outcomes from the LLM.
Demographic parity measures whether different groups of users receive similar outcomes from the LLM.

SelfCheckGPT is a method for evaluating the factual consistency of generated text by prompting the LLM to self-check its claims and provide evidence for its statements.

Question Answering as Evaluation (QAG) is an evaluation metric that assesses the quality of generated text by measuring how well it can answer questions related to the content.

Subjective metrics (for LLM behavior and hallucinations)

Defining metrics for LLM behavior, such as hallucinations, is not straightforward. You may need to monitor user-experience metrics to understand:

How users respond to the LLM output
How long an interaction lasts
How frequently users repeat the same question
Overall language and tone

Automated metrics are unlikely to capture language nuances and user sentiment just by reading a few user interactions. Humans are better at understanding sarcasm, humor, and subtle shifts in tone that can indicate user frustration or dissatisfaction.

You may need a human evaluator (Human in the Loop) to determine how well your LLM model is performing up to user expectations. Of course, this approach has drawbacks — it’s not scalable, it is prone to instability, and it’s not reproducible due to variations in subjective human judgments.

Including monitoring capabilities in LLM applications

One solution is to engineer monitoring and observability capabilities into the LLM application. While the LLM as an evaluator may also be inconsistent and prone to erroneous judgments, a systematic approach can help your LLM application to monitor model behavior in real-time within its own application pipeline:

The end-user can monitor the LLM for hallucination, off-topic response, and biased responses with simple prompts such as "Evaluate your response for factual accuracy, tone, and bias against the subject demographics." Or, when it delivers questionable information, prompt it with “Explain your reasoning.”
The LLM can be engineered to adapt its response to user queries and tone. The reasoning engine can use sentiment analysis across a series of prompt queries to identify its performance against user expectations.
The LLM can periodically evaluate its responses for consistency and drift detection. A Chain of Thought (CoT) evaluation can be performed by the user to ask how consistently it has responded to the intended user query, once the full context is available after the CoT interaction.
Within the LLM itself, allow users to flag high-risk or false responses, especially pertaining to legal, financial, or health-related matters.

Mitigating risks and ensuring safety

LLM monitoring for hallucinations and behavioral performance is an important research problem. As the industry and researchers are finding ways to improve LLM behavior, the problem remains largely unsolved with a real-time monitoring capability. It is virtually impossible to evaluate, interpret, and explain the behavior of black-box systems against specific user prompts.

You can, however, mitigate the risk of harmful, toxic, and discriminative output by engineering controls within your LLM application. For example:

Block or sanitize prompt queries that attempt to manipulate the LLM behavior, involve personal data, or present harmful intent.
Filter output against toxicity, bias, and hallucination by evaluating the output against trusted databases.
Hardcode safety rules: for example, engineer the LLM application to not respond with medical, legal, or financial advice on matters that can impact an individual user.

The path forward for monitoring LLMs

Effective LLM monitoring is crucial for ensuring the responsible and reliable deployment of these powerful models. While challenges remain in fully understanding and controlling LLM behavior, ongoing research and the development of robust monitoring techniques are paving the way for safer and more trustworthy AI systems. The future of LLM monitoring lies in a combination of objective metrics, subjective evaluations, and proactive risk mitigation strategies.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Learn 4 Min Read

Text Mining: Complete Beginner's Guide

Data is only useful if you can extract meaning from it. How do you understand texts in a macro way that might uncover new patterns? Text mining.

Learn 5 Min Read

What Is OpenTracing?

Though the OpenTracing project is no longer supported, learn how it worked & how to migrate to the newer OpenTelemetry framework.

Learn 7 Min Read

Data Monitoring: Benefits, Best Practices, and Automation Opportunities

Discover the significance of data monitoring: make informed decisions, optimize processes, mitigate risks and improve data quality with an automated system.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram