Reinforcement learning (RL) is a type of machine learning where a model learns by
making an observation of the current state it is in, picking out an action to execute, and
observing the reward of said action, after which it receives the next state and repeats the
...
Reinforcement learning (RL) is a type of machine learning where a model learns by
making an observation of the current state it is in, picking out an action to execute, and
observing the reward of said action, after which it receives the next state and repeats the
cycle until it reaches its goal. The traditional online training approach allows the agent to
directly interact with the live environment, but that is not always possible due to the live
environment possibly being too dangerous or costly to train in. In cases like these offline
training, which instead trains the agent on already pre-collected datasets of previously
mentioned interactions and tries to learn a better policy than the one used for collection,
offers a viable alternative by employing Q-Learning methods like CQL. However, prior
studies, such as Mediratta et al., have suggested that Behavior Cloning (BC), a
type of imitation cloning, may outperform modern offline RL methods in the multi-task
setting, where model generalization is tested on new or similar tasks rather than the ones
trained on. Considering these results, it begs the question whether it is worthwhile to
employ modern Q-Learning methods designed to derive a better policy than the one used
to collect the data, especially when they are unable to outperform standard imitation
learning.
This study seeks to reproduce and extend these findings within a custom environment.
The results reveal that, contrary to the aforementioned report, BC does not consistently
outperform CQL. Both machine learning methods exhibit comparable performance across
datasets varying in diversity and size. Additionally, incorporating more diverse data
significantly enhances generalization performance.