The aim of this paper is to show experimental data on the regret-based performance of various solver algorithms within a class of decision problems called Multi-Armed Bandits. This can help to more efficiently choose the algorithm most suited for an application and to reduce the
...
The aim of this paper is to show experimental data on the regret-based performance of various solver algorithms within a class of decision problems called Multi-Armed Bandits. This can help to more efficiently choose the algorithm most suited for an application and to reduce the amount of modification work needed to adapt the algorithm to fulfill its purpose. This is done because current research is largely based on theoretical analysis, strictly within a specific environment, using random data. These studies can be difficult to use to predict the performance of a practical implementation in a less strictly defined environment, but the amount of data on more directly comparable practical implementations is limited. The experiments involve several of the environments which some of the algorithms have been designed for, including a basic stochastic environment, a static linear contextual environment and two different changing contextual environments. Based on the experiments ran in these environments, we conclude that algorithms designed for stochastic environments tend to perform a lot worse in contextual environments than algorithms designed for linear contextual environments, and that all of the algorithms we tested that were designed for contextual environments tend to perform similarly in most cases. We also find that in most cases the contextual algorithms’ performance relative to each other in changing environments is generally similar to their relative performance in static ones, even though the absolute regret behaviour tend to be different in the different environments. Our findings also include various more specific strengths and weaknesses for the individual algorithms.