Challenges in the evaluation of conversational search systems

More Info
expand_more

Abstract

The area of conversational search has gained significant traction in the IR research community, motivated by the widespread use of personal assistants. An often researched task in this setting is conversation response ranking, that is, to retrieve the best response for a given ongoing conversation from a corpus of historic conversations. While this is intuitively an important step towards (retrieval-based) conversational search, the empirical evaluation currently employed to evaluate trained rankers is very far from this setup: typically, an extremely small number (e.g., 10) of non-relevant responses and a single relevant response are presented to the ranker. In a real-world scenario, a retrieval-based system has to retrieve responses from a large (e.g., several millions) pool of responses or determine that no appropriate response can be found. In this paper we aim to highlight these critical issues in the offline evaluation schemes for tasks related to conversational search. With this paper, we argue that the currently in-use evaluation schemes have critical limitations and simplify the conversational search tasks to a degree that makes it questionable whether we can trust the findings they deliver.