In the previous post, we explored the evaluation of the retrieval stage in RAG systems. In this post, we will focus on evaluating the generation stage. We will discuss three key quality aspects which provide a holistic evaluation of the entire RAG pipeline: faithfulness, answer relevance, and context relevance. We then address the challenge of the data bottleneck in evaluation, particularly the reliance on labeled data. Furthermore, we will see how these metrics are calculated with limited or no human labels in modern frameworks.

Quality Aspects for RAG evaluation

When evaluating the generation stage of Retrieval-Augmented Generation (RAG) systems, three key quality aspects must be considered: faithfulness, answer relevance, and context relevance [1].

Faithfulness

Faithfulness in RAG systems refers to the degree to which the generated content accurately reflects the information present in the retrieved documents. It is well-documented that language models can generate inaccurate or fabricated information [2]. Ensuring faithfulness is crucial for maintaining the trustworthiness and reliability of the system.

Answer Relevance

Answer relevance measures how directly the question is addressed by the generated answer, penalizing redundant information or when the answer is incomplete [1]. A highly relevant answer should directly respond to the query without introducing extraneous details or digressing from the topic. This is crucial because users expect concise and accurate responses that precisely address their inquiries.

Context Relevance

Context relevance refers to how relevant the retrieved context is to the query and generated answer, penalizing redundant or irrelevant information in the context [1]. This aspect is vital to ensure that the documents retrieved by the system genuinely support the generated responses and do not introduce noise. Proper contextual information helps the model generate more accurate and reliable answers, as it can draw from relevant details directly related to the user's query.

Metrics for Evaluating Generation

Metrics from machine translation and summarization, as well as other language tasks, have long been used to evaluate the generation stage of RAG systems. While valuable, they have limitations, particularly when applied to the complex task of evaluating retrieval-augmented generation [1].

Traditional Metrics

Traditional metrics like BLEU, ROUGE, precision, and recall have been discussed in detail in my previous posts (see here and here). Briefly, BLEU focuses on precision by measuring the overlap of n-grams between the generated text and reference texts, while ROUGE emphasizes recall, assessing how much of the content in the reference texts is captured by the generated text. Precision and recall themselves offer insights into the correctness and completeness of the generated answers. Despite their utility, these metrics fall short in capturing the nuanced requirements of retrieval-augmented generation, such as context relevance and factual accuracy. Additionally, they rely heavily on ground-truth references which are hard to generate [3], [4].

Perplexity

Perplexity measures how well a probability model predicts a sample, with lower perplexity indicating more confident and accurate predictions [5]. While useful for assessing the fluency of generated text, perplexity does not adequately capture the semantic correctness or relevance required in RAG systems [6].

METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is another metric that aims to address some of BLEU's limitations by considering synonyms, stemming, and paraphrasing, making it more flexible in evaluating text generation [7]. However, like BLEU and ROUGE, METEOR still falls short in fully capturing the faithfulness and contextual relevance of generated text.

Modern Metrics

As we move beyond traditional metrics, more nuanced criteria that capture the complexities of both retrieval and generation are essential. Metrics like BERTScore use contextual embeddings to compare the generated answer with the expected answer, offering a more sophisticated evaluation of semantic similarity [8]. This metric is particularly useful for assessing answer relevance and ensuring the generated text aligns closely with the query's intent.

Integrating human evaluation remains crucial for a comprehensive assessment. Automated metrics often miss the subtleties and context-dependent nuances that human evaluators can identify [4], [9]. Hybrid approaches that combine automated metrics with human judgment ensure a more robust and holistic evaluation, capturing both quantitative and qualitative aspects of the generated content.

In summary, while traditional metrics provide a foundation for evaluating text generation, they are insufficient for the specific needs of RAG systems. Emphasizing faithfulness, answer relevance, and context relevance through advanced metrics and hybrid evaluation approaches leads to a more accurate assessment of RAG-generated content. However, a significant challenge in the evaluation process is the data bottleneck, as many discussed metrics and evaluation techniques rely on extensive labeled datasets, which are often difficult and costly to obtain.

Addressing the Data Bottleneck

The data bottleneck is a major challenge in the evaluation of RAG systems. Traditional evaluation methods often depend on large, annotated datasets to provide ground truth references for assessing the generated content. Creating these datasets is not only time-consuming and expensive but also impractical for many real-world applications where data is constantly evolving or where domain-specific expertise is required for annotation.

Modern approaches reduce the need for extensive labeled data. Frameworks like RAGAs [1] and ARES [10] utilize large language models (LLMs) to generate synthetic evaluation data, creating high-quality data that mimics human-annotated examples and significantly reducing the need for manual labeling. These frameworks incorporate ways to assess the quality of both retrieval and generation stages, ensuring a comprehensive evaluation without extensive human intervention. This automated approach streamlines the evaluation process and allows for continuous improvement and adaptation as new data and tasks emerge.

Frameworks

In evaluating Retrieval-Augmented Generation (RAG) systems, several frameworks have emerged to address the unique challenges posed by these systems. We will look at two here.

RAGAs

The Retrieval-Augmented Generation Assessment System (RAGAs) is a framework designed to evaluate RAG systems by focusing on the three critical quality aspects we have discussed above: faithfulness, answer relevance, and context relevance. These metrics ensure that the generated content is accurate, relevant, and contextually appropriate [1].

One of the major innovations of RAGAs is its use of large language models (LLMs) to generate evaluation data instead of relying on human annotators. Traditional evaluation methods require extensive human-labeled datasets, which are costly and time-consuming to produce. RAGAs leverages LLMs to create synthetic data for evaluation, significantly reducing the dependency on human annotations. This approach not only makes the evaluation process more scalable but also more practical for real-world applications where obtaining labeled data is impractical.

By employing automated tools and pre-trained models to evaluate faithfulness, context relevance, and answer relevance, RAGAs offers a comprehensive assessment of RAG systems. This ensures that all critical aspects of quality are measured accurately, enhancing the overall reliability of the evaluation.

ARES

ARES (Automated Retrieval Evaluation System) is another framework designed to evaluate the performance of RAG systems. Similar to RAGAs, ARES aims to overcome the limitations of traditional evaluation methods by minimizing the reliance on annotated data [10].

ARES automates the evaluation process by using pre-trained language models to assess the quality of both the retrieval and generation stages. This automation allows for a more scalable and efficient evaluation process. ARES employs a variety of automated metrics to evaluate the relevance and quality of retrieved documents and generated answers, ensuring a comprehensive assessment of the entire RAG pipeline.

By focusing on dimensions such as relevance and coherence, ARES ensures that the system's outputs are accurate, relevant, and contextually appropriate. This holistic approach helps in identifying and addressing potential issues in both the retrieval and generation stages, ultimately improving the overall performance of the RAG system.

Conclusion

Evaluating the generation stage of Retrieval-Augmented Generation (RAG) systems requires more sophisticated approaches than traditional metrics like BLEU and ROUGE. By focusing on faithfulness, answer relevance, and context relevance, we ensure a comprehensive assessment of the generated output from a RAG pipeline.

Moreover, the data bottleneck challenge can be addressed using modern techniques such as few-shot and zero-shot learning with large language models (LLMs). These techniques are implemented in frameworks like RAGAs and ARES, enabling scalable and practical evaluation.

These advancements allow us to measure the performance of RAG systems accurately and efficiently, ultimately leading to more reliable and trustworthy AI-driven solutions.

References

[1] Es, Shahul, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated Evaluation of Retrieval Augmented Generation. (2023) http://arxiv.org/abs/2309.15217.

[2] Zhou, Wenxuan, Sheng Zhang, Hoifung Poon, and Muhao Chen. Context-Faithful Prompting for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, 14544–56. Singapore: Association for Computational Linguistics, 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.968.

[3] Dhingra, Bhuwan, Faruqui, Manaal, Parikh, Ankur, Chang, Ming-Wei, Das, Dipanjan, Cohen, William. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. (2019)

[4] Falke, Tobias, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference. (2019) In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2214–20. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/P19-1213

[5] Jelinek, F., R. L. Mercer, L. R. Bahl, and J. K. Baker. Perplexity—a Measure of the Difficulty of Speech Recognition Tasks. The Journal of the Acoustical Society of America 62, no. S1 (December 1, 1977): S63–S63. https://doi.org/10.1121/1.2016299.

[6] Wang, Shufan, Yixiao Song, Andrew Drozdov, Aparna Garimella, Varun Manjunatha, and Mohit Iyyer. KNN-LM Does Not Improve Open-Ended Text Generation. (2023) http://arxiv.org/abs/2305.14625

[7] Banerjee, Satanjeev and Lavie, Alon. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, (2005)

[8] Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. (2020) http://arxiv.org/abs/1904.09675

[9] Santhanam, Sashank, Behnam Hedayatnia, Spandana Gella, Aishwarya Padmakumar, Seokhwan Kim, Yang Liu, and Dilek Hakkani-Tur. Rome Was Built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation. (2022) http://arxiv.org/abs/2110.0545.

[10] Saad-Falcon, Jon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. (2024) http://arxiv.org/abs/2311.09476

Beyond Traditional Metrics - Evaluating the Generation Stage of RAG Systems

David Kirchhoff