Over the last decade, there have been significant advances in model-based deep reinforcement learning. One of the most successful such algorithms is AlphaZero which combines Monte Carlo Tree Search with deep learning. AlphaZero and its successors commonly describe a unified frame
...
Over the last decade, there have been significant advances in model-based deep reinforcement learning. One of the most successful such algorithms is AlphaZero which combines Monte Carlo Tree Search with deep learning. AlphaZero and its successors commonly describe a unified framework for tree construction and acting. For instance, build the tree with PUCT and act according to visitation counts. Policies based on visitation counts inherently make assumptions about the tree construction. This is problematic since it constrains the construction algorithm. For example, breadth-first tree construction yields a uniform visitation policy. To address this, we investigate the goals when extracting policies from decision trees and propose novel construction decoupled policies. Furthermore, we use these to modify how decision nodes are evaluated and utilize this during tree construction. We support the claim that our novel policies can benefit AlphaZero with theoretical analysis and empirical evidence. Our results on classical Gym environments show that the benefits are especially prominent for limited simulation budgets. The code is available through GitHub.