The aim of this paper is to challenge and compare several Multi-Armed Bandit algorithms in an en- vironment with fixed kernelized reward and noisy observations. Bandit algorithms are a class of decision-making problems with the goal of opti- mizing the trade-off between explorati
...
The aim of this paper is to challenge and compare several Multi-Armed Bandit algorithms in an en- vironment with fixed kernelized reward and noisy observations. Bandit algorithms are a class of decision-making problems with the goal of opti- mizing the trade-off between exploration and ex- ploitation of all choices. Each decision yields some reward, and the goal is to minimize the regret that follows from a combination of decisions, that is to say to minimize the difference between the set of decisions made, and the set of optimal decisions. In particular, these algorithms deal with the trade- off between choosing the best-known option and exploring new, possibly better options. These al- gorithms are widely used in reinforcement learn- ing, optimization and economics, where decisions need to be made without all the information and with some uncertainty. Each environment is dif- ferent however, and some algorithms are better in some environments than others.