Making Sense of RAG Evaluation Metrics

What they mean, what they don’t, and how to use them

Aug 28, 2025

black and white robot toy on red wooden table — Photo by Andrea De Santis on Unsplash

If you’ve been developing LLM-based applications, you’ve most likely come across Retrieval-Augmented Generation (RAG) as a go-to architecture that promises factual grounding, domain adaptation, and reduced hallucinations, without the need to retrain or fine-tune the LLM. And if you’ve tried to evaluate a RAG application, chances are you’ve encountered tools like RAGAS or DeepEval that offer metrics like faithfulness, context precision, or answer relevancy. But what do these metrics really tell us? What do they miss? And most importantly, how should we use them to make reliable, informed decisions about our system’s performance?

In this post I will walk you through six popular RAG evaluation metrics, explaining what they mean, what they don’t, and when to use each of them.

How RAG works

At its core, Retrieval-Augmented Generation (RAG) attempts to improve the performance of an LLM by giving it access to external knowledge sources at inference time. Instead of relying solely on what the model "knows" from its training data (which may be outdated, incomplete, or domain-agnostic), RAG lets the model retrieve relevant information from a knowledge base or document corpus and generate responses based on that retrieved content.

Now, there are several variations of RAG architectures but, in its simplest form, a RAG pipeline works in three phases:

Phase 1 - Retrieval: The system takes the user's query and uses it to retrieve relevant documents or chunks from a pre-indexed knowledge base. This is typically done using vector retrieval, where both the query and the documents are embedded in the same vector space using a language model, and similarity is computed via metrics like cosine similarity.
Phase 2 - Augmentation: The top-k retrieved documents are appended or inserted into the original query, creating an enriched input that gives the language model access to external, query-specific knowledge at inference time.,
Phase 3 - Generation: The enriched input is passed to the LLM which produces a response that (ideally) draws from both the user’s query and the retrieved context.

RAG Evaluation Metrics

As with most systems, a RAG application can be evaluated in an end-to-end fashion, looking only at its inputs (user queries) and outputs (generated responses). This kind of evaluation gives a high-level view of system performance that is useful for benchmarking and comparing variants.

But when the system doesn’t perform as expected, an end-to-end evaluation isn’t enough. We need to diagnose what’s going wrong, identify which components are underperforming, and understand why. For example, is the output wrong because the the retriever is failing to surface the right context or is the generator hallucinating and ignoring the provided evidence?

To support both macro-level and diagnostic evaluation, RAG evaluation toolkits offer a number of metrics, each addressing a different evaluation aspect or concern. These metrics are relatively easy to compute, but also easy to misunderstand or misapply. Let’s take a closer look at some of them.

Answer Correctness

Answer correctness measures whether the generated response is factually accurate and complete, based on a ground truth answer. It is a classic end-to-end quality metric that is calculated by comparing the response with the ground truth, using human judgments (typically via rating scales or binary “relevant”/”irrelevant” assessments) or automated methods like LLM-based judgments. The latter approach, often referred to as LLM-as-a-Judge, is increasingly used in automated evaluation pipelines for its speed and scalability. Whether it is reliably effective is a separate discussion that I’ll address in future posts.

A high correctness score means the answer matches the expected output while a low correctness score indicates that the system failed to produce the correct answer. In both cases, the metric does not tell us what is the reason behind the success or failure (e.g., hallucination, bad retrieval, etc).

Answer Relevancy

Answer relevancy measures how well the generated response addresses the user query. It’s an end-to-end metric, trying to capture the output’s usefulness and alignment with intent, regardless of how it was generated. Similarly to answer correctness, it can be computed using human judgments and/or LLM-based judgments.

A low answer relevancy score indicates that the answer is off-topic, incomplete, or fails to address the query in a meaningful way. On the other hand, a high relevancy score indicates that the answer is on-topic and aligned with the user’s intent without, however, guaranteeing that the answer is factually correct or that it has been actually based on the provided context. As such, this metric should mainly be used as a signal that something is wrong with the system when it’s low, but not as evidence that everything is working well when it’s high.

Faithfulness

Faithfulness is a metric that measures whether the generated answer is factually consistent with the retrieved context, i.e., whether the LLM is actually using the information it was given, and not making things up. It’s one of the most important metrics in RAG systems, because RAG is specifically designed to ground the LLM’s output in external knowledge, so if the LLM produces an answer that cannot be supported by the retrieved documents, it may be hallucinating.

Faithfulness is typically assessed by comparing the generated answer to the retrieved context. As with answer relevancy, the comparison can be done either manually, by human annotators, or automatically, using LLM-as-a-Judge methods. In the human setup, annotators are asked to determine whether the answer is fully supported, partially supported, or unsupported by the retrieved documents. In the LLM-based approach, a prompt instructs the model to make the same judgment.

In terms of interpretation, a low faithfulness score indicates that the answer contains information not found in the retrieved context; a strong signal of hallucination or over-generation. On the other hand, a high faithfulness score means that the answer is well-grounded in the retrieved context but it does not guarantee that the context itself is relevant or sufficient. In other words, an answer can be faithful but still wrong.

This means that the primary utility of faithfulness is to assess the effectiveness of the RAG system’s generation component and determine if that is to blame for the potentially wrong answers the system gives . If the system gives bad answers and faithfulness is high, then the problem lies in the context.

Contextual Relevancy

Contextual relevancy assesses how relevant the retrieved documents are to the original input query, independently of the generated answer. It focuses on the topical relevance of the retrieved context to the question being asked. This metric is important because even if an answer is faithful or well-formulated, poor contextual relevancy can signal that the retriever is misaligned with the user’s intent, retrieving vaguely related or off-topic information.

To compute contextual relevancy, we compare each retrieved document against the original query, rather than the answer. As with other previous metrics, this comparison can be done by humans, LLMs, or other semantic similarity methods.

A low contextual relevancy score suggests that the retriever is bringing in off-topic or noisy content, even if it’s factually accurate or well-structured. A high score, on the other hand, indicates that the context aligns well with the user's question, increasing the likelihood of a high-quality, relevant answer.

Nevertheless, a relevant context is not necessarily a useful one. Therefore, contextual relevancy should be used to assess the semantic quality of retrieval in relation to the user’s intent, and to diagnose misalignments between the query and the knowledge base.

Contextual Recall

Contextual recall measures the quality of the RAG pipeline's retriever by evaluating the extent of which the retrieved context includes all the information needed to produce the expected output. While faithfulness asks “Did the model stay true to the context it was given?”, contextual recall asks “Did the context contain the information needed in the first place?”

To compute contextual recall, we need a ground truth dataset that maps each input query to one or more expected outputs (typically, human-written answers or reference facts). Given such ground truth, the metric is calculated by checking whether the expected outputs (or their key components) appear in the retrieved documents.

A low contextual recall score suggests that the retriever failed to surface key information while a high contextual recall score means that the the context included the necessary information, making it more likely that the answer could have been correct if generation worked properly. As such, this metric is especially useful when the system gives wrong or incomplete answers and we want to determine if the retriever and/or the knowledge base is at fault.

Contextual Precision

Contextual precision measures the proportion of retrieved content that is actually relevant to answering the query. While contextual recall tells us whether the right information was included, contextual precision tells us whether the retrieved content is focused or cluttered with irrelevant or distracting information.

To compute contextual precision, we again need ground truth answers for the input queries. These answers are used to examine the retrieved documents and determine which ones contain information that meaningfully contributes to answering the question, typically using human annotators or LLMs acting as judges.

A low contextual precision score means the retriever is returning a lot of irrelevant or distracting content, which may confuse or mislead the generator. A high score, in turn, suggests the retriever is focused and selective, returning mostly useful information. As such, this metric is particularly helpful when diagnosing retrievers that return too many irrelevant documents, which can overwhelm the generator and reduce the overall system quality.

Putting it all together

It should be clear by now that each metric offers a narrow view into a specific component or failure mode of a RAG system. But when used in combination, these metrics can provide a much more complete diagnostic picture, helping you understand not just that something is wrong, but where and why. Here are some examples:

High Answer Relevancy + Low Faithfulness: The generated response appears aligned with the user’s question and sounds convincing, but it is not actually supported by the retrieved context. This typically indicates hallucination in the generation step, where the model produces plausible-sounding content that isn’t grounded in the available evidence.
Low Answer Relevancy + High Contextual Recall: The system successfully retrieved useful information, but the generated answer fails to address the user’s query. In this case, the problem likely lies in the generation step: the model had access to the necessary context but didn’t use it effectively to produce a relevant or meaningful response.
Low Faithfulness + Low Contextual Recall: The system retrieved the wrong or insufficient context and then generated an answer that is not grounded in that context, effectively hallucinating the response. In this case, the likely cause is a retrieval failure, followed by an ungrounded or fabricated generation. The system never had access to the necessary information, and the LLM attempted to fill in the gaps on its own.

Are there any other evaluation metrics or patterns you’ve found particularly useful (or not) in your own work with RAG applications? If so, please share them in the comments.

Thank you for reading, till next time!
Panos

News and Updates

On September 16th and 17th, 2025, I will be teaching the 7th edition of my live online course “Knowledge Graphs and Large Language Models Bootcamp” at the O’Reilly Learning Platform. You can see more details and register here. The course is free if you are already subscribed in the platform, and you can also take advantage of a 10-day free trial period.
On September 24th, 25th and 26th, 2025, I will be attending the PyData Amsterdam conference, giving also a masterclass with the title “Grounding LLMs on Solid Knowledge: Assessing and Improving Knowledge Graph Quality in GraphRAG Applications”, on the 24th. You can see more details and register here. If you are attending the conference and would like to grab a coffee and chat about data and AI please contact me.
On October 21st and 22nd, 2025, I will be teaching the 1st edition of my live online course “AI Evaluations Bootcamp“, at the O’Reilly Learning Platform. You can see more details and register here. The course is free if you are already subscribed in the platform, and you can also take advantage of a 10-day free trial period.
As I am writing my new book on Evaluating AI Systems, I am in the hunt for “war stories” on AI evaluation. If you have such stories, use cases, techniques, tools, or lessons you would like to share, I’d love to hear from you.
If you are interested in the field of semantic data modeling and knowledge graphs, my book Semantic Modeling for Data - Avoiding Pitfalls and Breaking Dilemmas remains available at O’Reilly, Amazon, and most known bookstores. Also, if you've already read it and have thoughts, I’d really appreciate it if you left a rating or review; it helps others discover the book and join the conversation.

The Codex And The Compass

Discussion about this post