Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Sun, Zhu; Yu, DI; Fang, H; Yang, Jie; Qu, Xinghua; Zhang, Jie; Geng, Cong

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Conference paper (2020)

Authors

Zhu Sun Macquarie University

DI Yu Shanghai University of Finance and Economics

H Fang Shanghai University of Finance and Economics

Jie Yang Web Information Systems

Xinghua Qu Nanyang Technological University

Jie Zhang Nanyang Technological University

Cong Geng Shanghai University of Finance and Economics

Research Group

Web Information Systems

Recommender Systems Benchmarks Reproducible Evaluation

To reference this document use:

http://resolver.tudelft.nl/uuid:fa4d29ab-e7d4-4b26-ae32-1ea4feba24f2

More Info

expand_more

Published Date

2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Research Group

Web Information Systems

Abstract

With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).

Files

3383313.3412489.pdf

(pdf | 1.36 Mb)

License info not available

Download not available