Policy Distillation in Offline Multi-task Reinforcement Learning
More Info
expand_more
Abstract
In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards. Multi-Task Reinforcement Learning (MTRL) extends this concept by training a single agent to perform multiple tasks simultaneously, allowing for more efficient use of resources and behavior sharing between tasks. Policy Distillation (PD) is a technique commonly used in MTRL, where policies from multiple single-task agents (teachers) are distilled into a single multi-task agent (student). This is done by merging common structure across tasks, while separating task-specific properties.
However, existing PD approaches require interactions with the environment during training. In this work, we investigate the effectiveness of PD in the offline setting, where the agent has no interaction with the environment before deployment and can only learn from previously collected data. Through a series of experiments, we demonstrate that a straightforward approach yields the highest performance. This approach involves first learning teacher policies using an existing offline RL algorithm, then distilling these policies into a student by sampling states from the offline data and applying a Mean Squared Error (MSE) loss between the teachers’ and student’s best actions. Moreover, we investigate the effect of a state distribution shift—a major challenge in offline RL—on our approach. We find that such shifts impact performance only slightly in cases of relatively small neural networks or substantial distribution shifts.
We also explore how PD can be enhanced to better capture common structure across related tasks, a key to improving efficiency in MTRL. To this end, we formally define common structure at two levels: the trajectory level and the computational level. To the best of our knowledge, we present the first attempt to quantify the amount of common structure shared across tasks. This measurement reveals that task commonalities are not fully exploited automatically. At the computational level, we attempt to improve sharing of common structure by reducing the network size and adding a regularization term to the loss function. To capture more common structure at the trajectory level, we argue that multi-task exploration is required, meaning that behaviors from one task must be evaluated in the context of another task. We propose two extensions to our approach that introduce multi-task exploration: Data Sharing (DS) and Offline Q-Switch (OQS). While these extensions are capable of improving performance, they also have clear limitations.
Overall, we propose a new, high-performing offline MTRL method and provide valuable insights into the fundamental capabilities and limitations of PD in capturing common structure across tasks, specifically within the offline MTRL setting.