Learning Complex Policy Distribution with CEM Guided Adversarial Hypernetwork

Tang, Shi Yuan; Oliehoek, F.A.; Irissappane, Athirai A.; Zhang, Jie

Learning Complex Policy Distribution with CEM Guided Adversarial Hypernetwork

Conference paper (2021)

Authors

Shi Yuan Tang Nanyang Technological University

F.A. Oliehoek Interactive Intelligence -

Athirai A. Irissappane University of Washington

Jie Zhang Nanyang Technological University

Research Group

Interactive Intelligence () (TU Delft)

Reinforcement Learning Generative Adversarial Networks Cross-Entropy Method Hypernetworks

To reference this document use:

http://resolver.tudelft.nl/uuid:879acf0f-ade2-456e-b5f4-8ba1988b5549

More Info

expand_more

Published Date

2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Intelligent Systems

Research Group

Interactive Intelligence

Abstract

Cross-Entropy Method (CEM) is a gradient-free direct policy search method, which has greater stability and is insensitive to hyperparameter tuning. CEM bears similarity to population-based evolutionary methods, but, rather than using a population it uses a distribution over candidate solutions (policies in our case). Usually, a natural exponential family distribution such as multivariate Gaussian is used to parameterize the policy distribution. Using a multivariate Gaussian limits the quality of CEM policies as the search becomes confined to a less representative subspace. We address this drawback by using an adversarially-trained hypernetwork, enabling a richer and complex representation of the policy distribution. To achieve better training stability and faster convergence, we use a multivariate Gaussian CEM policy to guide our adversarial training process. Experiments demonstrate that our approach outperforms state-of-the-art CEM-based methods by 15.8% in terms of rewards while achieving faster convergence. Results also show that our approach is less sensitive to hyper-parameters than other deep-RL methods such as REINFORCE, DDPG and DQN.

Files

P1308.pdf

(pdf | 2.57 Mb)

Unknown license