T.D. de Bruin | TU Delft Repository

Fine-tuning deep RL with gradient-free optimization

Journal article (2020) - T.D. De Bruin (author), Tim De Bruin (author), T.D. de Bruin (author), Tim de Bruin (author), Jens Kober (author), J. Kober (author), Karl Tuyls (author), Robert Babuška (author), R Babuška (author), Robert Babuska (author), R. Babuska (author), R Babuska (author), R. Babuška (author)

Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter c ...

Sample effficient deep reinforcement learning for control

Doctoral thesis (2020) - Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), T.D. de Bruin (author)

The arrival of intelligent, general-purpose robots that can learn to perform new tasks autonomously has been promised for a long time now. Deep reinforcement learning, which combines reinforcement learning with deep neural network function approximation, has the potential to enab ...

The arrival of intelligent, general-purpose robots that can learn to perform new tasks autonomously has been promised for a long time now. Deep reinforcement learning, which combines reinforcement learning with deep neural network function approximation, has the potential to enable robots to learn to perform a wide range of new tasks while requiring very little prior knowledge or human help. This framework might therefore help to finally make general purpose robots a reality. However, the biggest successes of deep reinforcement learning have so far been in simulated game settings. To translate these successes to the real world, significant improvements are needed in the ability of these methods to learn quickly and safely. This thesis investigates what is needed to make this possible and makes contributions towards this goal. Before deep reinforcement learning methods can be successfully applied in the robotics domain, an understanding is needed of how, when, and why deep learning and reinforcement learning work well together. This thesis therefore starts with a literature review, which is presented in Chapter 2. While the field is still in some regards in its infancy, it can already be noted that there are important components that are shared by successful algorithms. These components help to reconcile the differences between classical reinforcement learning methods and the training procedures used to successfully train deep neural networks. The main challenges in combining deep learning with reinforcement learning center around the interdependencies of the policy, the training data, and the training targets. Commonly used tools for managing the detrimental effects caused by these interdependencies include target networks, trust region updates, and experience replay buffers. Besides reviewing these components, a number of the more popular and historically relevant deep reinforcement learning methods are discussed. Reinforcement learning involves learning through trial and error. However, robots (and their surroundings) are fragile, which makes these trials---and especially errors---very costly. Therefore, the amount of exploration that is performed will often need to be drastically reduced over time, especially once a reasonable behavior has already been found. We demonstrate how, using common experience replay techniques, this can quickly lead to forgetting previously learned successful behaviors. This problem is investigated in Chapter 3. Experiments are conducted to investigate what distribution of the experiences over the state-action space leads to desirable learning behavior and what distributions can cause problems. It is shown how actor-critic algorithms are especially sensitive to the lack of diversity in the action space that can result form reducing the amount of exploration over time. Further relations between the properties of the control problem at hand and the required data distributions are also shown. These include a larger need for diversity in the action space when control frequencies are high and a reduced importance of data diversity for problems where generalizing the control strategy across the state-space is more difficult. While Chapter 3 investigates what data distributions are most beneficial, Chapter 4 instead proposes practical algorithms to {select} useful experiences from a stream of experiences. We do not assume to have any control over the stream of experiences, which makes it possible to learn from additional sources of experience like other robots, experiences obtained while learning different tasks, and experiences obtained using predefined controllers. We make two separate judgments on the utility of individual experiences. The first judgment is on the long term utility of experiences, which is used to determine which experiences to keep in memory once the experience buffer is full. The second judgment is on the instantaneous utility of the experience to the learning agent. This judgment is used to determine which experiences should be sampled from the buffer to be learned from. To estimate the short and long term utility of the experiences we propose proxies based on the age, surprise, and the exploration intensity associated with the experiences. It is shown how prior knowledge of the control problem at hand can be used to decide which proxies to use. We additionally show how the knowledge of the control problem can be used to estimate the optimal size of the experience buffer and whether or not to use importance sampling to compensate for the bias introduced by the selection procedure. Together, these choices can lead to a more stable learning procedure and better performing controllers. In Chapter 5 we look at what to learn form the collected data. The high price of data in the robotics domain makes it crucial to extract as much knowledge as possible from each and every datum. Reinforcement learning, by default, does not do so. We therefore supplement reinforcement learning with explicit state representation learning objectives. These objectives are based on the assumption that the neural network controller that is to be learned can be seen as consisting of two consecutive parts. The first part (referred to as the state encoder) maps the observed sensor data to a compact and concise representation of the state of the robot and its environment. The second part determines which actions to take based on this state representation. As the representation of the state of the world is useful for more than just completing the task at hand, it can also be trained with more general (state representation learning) objectives than just the reinforcement learning objective associated with the current task. We show how including these additional training objectives allows for learning a much more general state representation, which in turn makes it possible to learn broadly applicable control strategies more quickly. We also introduce a training method that ensures that the added learning objectives further the goal of reinforcement learning, without destabilizing the learning process through their changes to the state encoder. The final contribution of this thesis, presented in Chapter 6, focuses on the optimization procedure used to train the second part of the policy; the mapping from the state representation to the actions. While we show that the state encoder can be efficiently trained with standard gradient-based optimization techniques, perfecting this second mapping is more difficult. Obtaining high quality estimates of the gradients of the policy performance with respect to the parameters of this part of the neural network is usually not feasible. This means that while a reasonable policy can be obtained relatively quickly using gradient-based optimization approaches, this speed comes at the cost of the stability of the learning process as well as the final performance of the controller. Additionally, the unstable nature of this learning process brings with it an extreme sensitivity to the values of the hyper-parameters of the training method. This places an unfortunate emphasis on hyper-parameter tuning for getting deep reinforcement learning algorithms to work well. Gradient-free optimization algorithms can be more simple and stable, but tend to be much less sample efficient. We show how the desirable aspects of both methods can be combined by first training the entire network through gradient-based optimization and subsequently fine-tuning the final part of the network in a gradient-free manner. We demonstrate how this enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. @en

Vision-based navigation using deep reinforcement learning

Conference paper (2019) - J. Kulhánek (author), Erik Derner (author), T.D. de Bruin (author), T.D. De Bruin (author), Tim De Bruin (author), Tim de Bruin (author), Robert Babuška (author), Robert Babuška (author), Robert Babuska (author), Robert Babuska (author), R Babuska (author), R Babuska (author), R. Babuska (author), R. Babuska (author), R Babuška (author), R Babuška (author), R. Babuška (author), R. Babuška (author)

Deep reinforcement learning (RL) has been successfully applied to a variety of game-like environments. However, the application of deep RL to visual navigation with realistic environments is a challenging task. We propose a novel learning architecture capable of navigating an age ...

Integrating state representation learning into deep reinforcement learning

Journal article (2018) - T.D. de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), J. Kober (author), Jens Kober (author), Karl Tuyls (author), K.P. Tuyls (author), R. Babuška (author), Robert Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R Babuška (author)

Most deep reinforcement learning techniques are unsuitable for robotics, as they require too much interaction time to learn useful, general control policies. This problem can be largely attributed to the fact that a state representation needs to be learned as a part of learning c ...

Reinforcement learning for control

Performance, stability, and deep approximators

Review (2018) - Lucian Busoniu (author), Lucían Buşoniu (author), Lucian Buşoniu (author), Lucían Busoniu (author), Tim De Bruin (author), Tim de Bruin (author), T.D. de Bruin (author), T.D. De Bruin (author), Domagoj Tolić (author), Jens Kober (author), J. Kober (author), Ivana Palunko (author)

Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the con ...

Experience selection in deep reinforcement learning for control

Journal article (2018) - T.D. de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), J. Kober (author), Jens Kober (author), Karl Tuyls (author), Karl Tuyls (author), K.P. Tuyls (author), K.P. Tuyls (author), R. Babuška (author), Robert Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R Babuška (author)

Experience replay is a technique that allows off-policy reinforcement-learning methods to reuse past experiences. The stability and speed of convergence of reinforcement learning, as well as the eventual performance of the learned policy, are strongly dependent on the experiences ...

Railway track circuit fault diagnosis using recurrent neural networks

Journal article (2017) - T.D. de Bruin (author), Tim de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), K.A.J. Verbert (author), K. Verbert (author), Kim Verbert (author), Robert Babuška (author), R Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R. Babuška (author)

Timely detection and identification of faults in railway track circuits are crucial for the safety and availability of railway networks. In this paper, the use of the long-short-term memory (LSTM) recurrent neural network is proposed to accomplish these tasks based on the commonl ...

Off-policy experience retention for deep actor-critic learning

Conference paper (2016) - T.D. de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), J. Kober (author), Jens Kober (author), Karl Tuyls (author), Karl Tuyls (author), K.P. Tuyls (author), K.P. Tuyls (author), R. Babuška (author), Robert Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R Babuška (author)

When a limited number of experiences is kept in memory to train a reinforcement learning agent, the criterion that determines which experiences are retained can have a strong impact on the learning performance. In this paper, we argue that for actor critic learning in domains wit ...

Improved deep reinforcement learning for robotics through distribution-based experience retention

Conference paper (2016) - T.D. de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), J. Kober (author), Jens Kober (author), Karl Tuyls (author), Karl Tuyls (author), K.P. Tuyls (author), K.P. Tuyls (author), R. Babuška (author), Robert Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R Babuška (author)

Recent years have seen a growing interest in the use of deep neural networks as function approximators in reinforcement learning. In this paper, an experience replay method is proposed that ensures that the distribution of the experiences used for training is between that of the ...

The importance of experience replay database composition in deep reinforcement learning

Conference paper (2015) - T.D. de Bruin (author), Tim De Bruin (author), T.D. De Bruin (author), Tim de Bruin (author), J. Kober (author), Jens Kober (author), Karl Tuyls (author), Karl Tuyls (author), K.P. Tuyls (author), K.P. Tuyls (author), R. Babuška (author), Robert Babuška (author), Robert Babuska (author), R Babuska (author), R. Babuska (author), R Babuška (author)

Recent years have seen a growing interest in the use of deep neural networks as function approximators in reinforcement learning. This paper investigates the potential of the Deep Deterministic Policy Gradient method for a robot control problem both in simulation and in a real se ...