Evaluating the Efficacy and User Reliance on RAG Model Outputs
A comparative study with human experts
More Info
expand_more
Abstract
The emergence of conversational AI systems like ChatGPT and Microsoft Copilot has impacted how users engage in information retrieval.
Retrieval Augmented Generation (RAG) harnesses the potential of Large Language Models (LLMs) with unstructured data, creating opportunities in science and business.
RAG-based models have gained popularity, but their effectiveness and user reliance in organizational settings call for exploration. This thesis involves a user study with policy experts in the financial domain.
They were tasked with text aggregation using a basic RAG model. The study delves into the model’s performance and the temporal development of user reliance among the experts over four weeks.
Our key findings reveal that outputs assisted by RAG do not match the quality produced by human experts.
The RAG model, however, excels in specific aspects such as structure, spelling, and grammar.
Additionally, the experts express satisfaction with the efficiency of RAG. Our findings suggest that user reliance on RAG increases with experience.
This underscores the need for interventions and policies to support responsible human-AI collaboration.
This work represents an effort to measure the temporal aspects of user reliance within an RAG system.
Simultaneously, it assesses the system’s efficacy in a field study with policy experts in the financial domain.