C.T. Ponnambalam | TU Delft Repository

Abstraction-Guided Modular Reinforcement Learning

Doctoral thesis (2023) - C.T. Ponnambalam (author)

Reinforcement learning (RL) models the learning process of humans, but as exciting advances are made that use increasingly deep neural networks, some of the fundamental strengths of human learning are still underutilized by RL agents. One of the most exciting properties of RL is ...

Reinforcement learning (RL) models the learning process of humans, but as exciting advances are made that use increasingly deep neural networks, some of the fundamental strengths of human learning are still underutilized by RL agents. One of the most exciting properties of RL is that it appears to be incredibly flexible, requiring no model or knowledge of the task to be solved. However, this thesis argues that RL is inherently inflexible for two main reasons: 1. If there is existing knowledge, incorporating this without compromising the optimality of the solution is highly non-trivial, and 2. RL solutions can not be easily transferred between tasks, and generally require complete retraining to guarantee that a solution will work in a new task. Humans, on the other hand, are very flexible learners. We easily transfer knowledge from one task to another, and can learn from knowledge that we learned in other tasks or that other people share with us. Humans are exceptionally good at abstraction, or developing conceptual understandings that allow us to extend knowledge to never-before seen experiences. No artificial agent nor neural network has displayed the abstraction and generalization capabilities of humans in such varied tasks and environments. Despite this, utilizing the human as a tool for abstraction is commonly done only at the stage of defining the model. In general, this means making choices about what to include in the state space that will make the problem solvable without adding unnecessary complexity. While necessary, this step is not explicitly referred to as abstraction, and it is generally not considered relevant to how RL is applied. Much of the research in RL is less focused on how the problem is modelled, and instead centers the development and application of computational advances that allow for solving bigger and bigger problems. Applying abstraction explicitly is highly non-trivial, as confirming that an abstract problem preserves the necessary information of the true problem can generally only be done if a full solution is already found, which may defeat the purpose of finding an abstraction if such a solution cannot be found. When such a confirmation can be made, the abstraction can be the result of a very complex function that would be difficult for a human to define. In this work, human-defined abstractions are used in a way that goes beyond the initial definition of the problem. The first approach, presented in Chapter 3, breaks a problem into several abstract problems, and uses the same experience to solve each at the same time. A meta-agent learns how to compose the learned policies together to find the optimal policy. In Chapter 4, a method is introduced that uses supervised learning to train a model on partially observable experience which is labelled with hindsight. The agent then learns a policy on predicted states, trading off information gathering with reward maximization. The last method presented in Chapter 5 is a modular approach to offline RL, where even with expert data, the method can become ineffective if the given data does not cover the entire problem space. This method introduces a second problem of recovering the agent to a state where it can safely follow the expert’s action. The method applies abstraction to multiply the given data and safely plan recovery policies. Combining the recovery policies with the imitation policy maintains high performance even when the expert data provided is limited. In the methods developed in this research, a learning-to-learn component enables the agent to relax the usually strict requirements of abstraction, the parallel processing allows the agent to learn more from fewer samples, and the modularity means that the agent can transfer its knowledge to other related tasks. @en

Back to the Future

Solving Hidden Parameter MDPs with Hindsight

Conference paper (2022) - C.T. Ponnambalam (author), Danial Kamran (author), T. D. Simão (author), F.A. Oliehoek (author), M.T.J. Spaan (author)

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics using Constrained Reinforcement Learning

Conference paper (2022) - Danial Kamran (author), T. D. Simão (author), Q. Yang (author), C.T. Ponnambalam (author), Johannes Fischer (author), M.T.J. Spaan (author), Martin Lauer (author)

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositio ...

PEBL: Pessimistic Ensembles for Offline Deep Reinforcement Learning

Conference paper (2021) - Jordi Smit (author), C.T. Ponnambalam (author), M.T.J. Spaan (author), F.A. Oliehoek (author)

Offline reinforcement learning (RL), or learning from a fixed data set, is an attractive alternative to online RL. Offline RL promises to address the cost and safety implications of tak- ing numerous random or bad actions online, a crucial aspect of traditional RL that makes it d ...

Abstraction-Guided Policy Recovery from Expert Demonstrations

Conference paper (2021) - C.T. Ponnambalam (author), F.A. Oliehoek (author), M.T.J. Spaan (author)

Behavior cloning is a method of automated decision-making that aims to extract meaningful information from expert demonstrations and reproduce the same behavior autonomously. It is unlikely that demonstrations will exhaustively cover the potential problem space, compromising the ...

Abstraction-Guided Policy Recovery from Expert Demonstrations

Conference paper (2020) - C.T. Ponnambalam (author), F.A. Oliehoek (author), M.T.J. Spaan (author)

The goal in behavior cloning is to extract meaningful information from expertdemonstrations and reproduce the same behavior autonomously. However, theavailable data is unlikely to exhaustively cover the potential problem space. As aresult, the quality of automated decision makin ...

Interval Q-Learning: Balancing Deep and Wide Exploration

Conference paper (2020) - G. Neustroev (author), C.T. Ponnambalam (author), M.M. de Weerdt (author), M.T.J. Spaan (author)

Reinforcement learning requires exploration, leading to repeated execution of sub-optimal actions. Naive exploration techniques address this problem by changing gradually from exploration to exploitation. This approach employs a wide search resulting in exhaustive exploration and ...