F.A. Oliehoek | TU Delft Repository

Risk-sensitive Reinforcement Learning for Portfolio Allocation

Master thesis (2024) - A.A. Sinha (author) , Frans A. Oliehoek (mentor) , Luciano Cavalcante Siebert (graduation committee member) , A. Papapantoleon (graduation committee member) , Mustafa Mert Çelikok (graduation committee member) , Rob Huisman (graduation committee member)

This study explores the application of risk-sensitive Reinforcement Learning (RL) in portfolio optimization, aiming to integrate asset pricing and portfolio construction into a unified, end-to-end RL framework. While RL has shown promise in various domains, its traditional risk-n ...

Influence Based Multi Agent Reinforcement Learning for Active Wake Control

Using influence to increase energy production using multi agent reinforcement learning

Master thesis (2024) - M.K. Plesner (author) , F.A. Oliehoek (mentor) , Mathijs M. de de Weerdt (graduation committee member) , G. Neustroev (mentor)

The increasing demand for electricity has lead to demand for more efficient energy production. One promising option is wind power, which currently provides an estimated 7.8% of the world’s energy production. One of the problems with wind energy is that a small percentage of ...

The increasing demand for electricity has lead to demand for more efficient energy production. One promising option is wind power, which currently provides an estimated 7.8% of the world’s energy production. One of the problems with wind energy is that a small percentage of the energy is lost due to the wake effect. The wake of a wind turbine is an area of low wind speed and high turbulence which is caused by the spinning of the turbine. This wake effect can mitigated by active wake control, which is a process by which the wake from a turbine is redirected away from downwind turbines, by changing the yaw of the turbine head. Calculating a policy for doing this is computationally expensive to do using numerical optimisation. Therefore, multi agent reinforcement learning is proposed to learn a policy which performs active wake control.
The proposed approach makes use of the popular reinforcement learning algorithm REINFORCE, and extends it using a variety of methods. First, a simplified version of the problem is treated, wherein the wind direction is fixed. Then the problem is made more realistic by introducing changing wind directions. The first extension of REINFORCE that is treated is difference rewards, a reward shaping strategy which seeks to solve the credit assignment problem, thereby improving cooperation between turbines. The second method uses training regimes, which train different agents at different times to stabilise the environment as much as possible. Next, role-based reinforcement learning is used to conteract the complexity of the problem by allowing each agent to specialise for a certain role. Finally, since roles cannot be manually determined for larger farms, influence-based abstraction is used to enable agents to learn the roles themselves, by abstracting spacial information and presenting it to the agent as an observation.
The results demonstrate that multi agent reinforcement learning can be used to perform active wake control in wind farms. Furthermore, the extensions proposed are shown to improve learning, and lead to greater energy output. While multi agent reinforcement learning is shown to be a promising way to tackle active wake control in wind farms, research is needed to improve the stability of the learned policies.

Use of sample-splitting and cross-fitting techniques to mitigate the risks of double-dipping in behaviour-agnostic reinforcement learning

Comparative Analysis

Bachelor thesis (2024) - Y. Aslan (author) , S.R. Bongers (mentor) , Frans A Oliehoek (mentor) , Catholijn M. Jonker (graduation committee member)

This paper addresses the issue of double-dipping in off-policy evaluation (OPE) in behaviour-agnostic reinforcement learning, where the same dataset is used for both training and estimation, leading to overfitting and inflated performance metrics especially for variance. We intro ...

The Effect of State-visitation Mismatch on Off-policy Performance in Behaviour-agnostic Reinforcement Learning

Bachelor thesis (2024) - Kevin C. Chen (author) , S.R. Bongers (mentor) , F.A. Oliehoek (mentor) , C.M. Jonker (graduation committee member)

Off-policy evaluation has some key problems with one of them being the “curse of horizon”. With recent breakthroughs [1] [2], new estimators have emerged that utilise importance sampling of the individual state-action pairs and reward rather than over the whole trajectory. With t ...

SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation

Bachelor thesis (2024) - C. Brita (author) , Frans A Oliehoek (mentor) , S.R. Bongers (mentor) , Catholijn M. Jonker (graduation committee member)

In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic ...

Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

Bachelor thesis (2024) - H. Cho (author) , Frans A Oliehoek (mentor) , S.R. Bongers (mentor) , Catholijn M. Jonker (graduation committee member)

In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitat ...

The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

Bachelor thesis (2024) - T. Sabău (author) , Frans A. Oliehoek (mentor) , S.R. Bongers (mentor) , Catholijn Jonker (graduation committee member)

Behavior-agnostic reinforcement learning is a rapidly expanding research area focusing on developing algorithms capable of learning effective policies without explicit knowledge of the environment's dynamics or specific behavior policies. It proposes robust techniques to perform ...

See Clearly, Act Intelligently: Transformers in Transparent Environments

Bachelor thesis (2024) - O. Elamin (author) , Jinke He (mentor) , F.A. Oliehoek (mentor) , Mathijs M. De Weerdt (graduation committee member)

Traditionally, Recurrent Neural Networks (RNNs) are used to predict the sequential dynamics of the environment. With the advancement and breakthroughs of Transformer models, there has been demonstrated improvement in the performance & sample efficiency of Transformers as worl ...

Understanding the Effects of Discrete Representations in Model-Based Reinforcement Learning

An analysis on the effects of categorical latent space world models on the MinAtar Environment

Bachelor thesis (2024) - M. Mitrea (author) , F.A. Oliehoek (mentor) , Jinke He (mentor) , Mathijs M. De Weerdt (graduation committee member)

While model-free reinforcement learning (MFRL) approaches have been shown effective at solving a diverse range of environments, recent developments in model-based reinforcement learning (MBRL) have shown that it is possible to leverage its increased sample efficiency and generali ...

Task-Unaware Lifelong Robot Learning with Retrieval-based Weighted Local Adaptation

Master thesis (2024) - P. Yang (author) , Cong Wang (mentor) , J. Kober (mentor) , F.A. Oliehoek (mentor) , C.A. Raman (graduation committee member)

Real-world environments require robots to continuously acquire new skills while retain-ing previously learned abilities, all without the need for clearly defined task boundaries. Storing all past data to prevent forgetting is impractical due to storage and privacy con-cerns. To a ...

The Effects of Heuristic Optimisations on Planning Algorithms Within Cooperative AI

Cooperative Planning in Overcooked

Bachelor thesis (2023) - J.H.J. Herben (author) , Robert Loftin (mentor) , Frans A. Oliehoek (mentor) , K.A. Hildebrandt (graduation committee member)

Cooperative AI is AI designed to cooperate with humans. One example of such an AI, made using planning algorithms, was studied in a paper from 2019 which used a simplified version of the video game Overcooked for evaluation. However, only limited evaluations were possible due to ...

Scripted AI for Overcooked

Designing and Evaluating a Scripted AI Controller for Simplified Overcooked

Bachelor thesis (2023) - M.C. Anton (author) , Robert Loftin (mentor) , Frans A. Oliehoek (mentor) , K.A. Hildebrandt (graduation committee member)

Overcooked, an immersive multiplayer video game centered around cooperative cooking challenges, provides the roots for this research project. The study focuses on designing and evaluating a hand-authored controller in comparison to controllers implemented using various machine le ...

Cooperative AI for Overcooked

Multi-Agent RL with Population-Based Training

Bachelor thesis (2023) - I.N. Nestorov (author) , Robert Loftin (mentor) , Frans A Oliehoek (mentor) , K.A. Hildebrandt (graduation committee member)

In ad-hoc cooperative environments, the usage of artificial intelligence to take supportive roles and work in collaboration with humans has proven to be of great benefit. The objective of this research is to evaluate the use of population-based training for reinforcement learning ...

Getting AI to Cooperate: Sharing a Critic in a Video Game

Bachelor thesis (2023) - J.J.H. Groenendijk (author) , Robert Loftin (mentor) , Frans A. Oliehoek (mentor) , K.A. Hildebrandt (graduation committee member)

The popular video game "Overcooked" is a great example of a task requiring complex planning and cooperation with other players. This game is used as the inspiration for an environment for evaluating AI, called "Overcooked-AI". This paper implements a centralized critic into the O ...

Improvements in Imitation Learning for Overcooked

Bachelor thesis (2023) - D.P. Niemantsverdriet (author) , Robert Loftin (mentor) , Frans A Oliehoek (mentor) , K.A. Hildebrandt (graduation committee member)

Arguably the main goal of artificial intelligence is to create agents that can collaborate with humans to achieve a shared goal. It has been shown that agents that assume their partner to be optimal can converge to protocols that humans do not understand. Taking human suboptimali ...

Multi-objective Deep Reinforcement Learning for predictive maintenance of road networks

Master thesis (2023) - K. Krachtopoulos (author) , Frans Oliehoek (mentor) , C. P. Andriotis (mentor) , Robert Loftin (mentor) , J.W. Böhmer (graduation committee member)

Operation and maintenance of the built environment have a major effect on socioeconomic stability and sustainability. A significant part of our built world approaches or has well exceeded its designated structural life. As engineers, we need to find efficient ways to extend this ...

General Reinforcement Learning Agents for Crop Management

Master thesis (2023) - A. Theocharis (author) , Frans Oliehoek (mentor) , Matteo Turchetta (graduation committee member) , Andreas Krause (graduation committee member)

Agriculture plays a vital role in the global economy, providing the necessary food and resources for human survival. With the world’s population projected to surge, the demand for food is set to escalate in the coming decades. This increasing demand, coupled with the challenges p ...

Modelling Agents with Variational Autoencoders in Multi-Agent Sequential Decision Making

Master thesis (2023) - H.L. Lenferink (author) , Frans A Oliehoek (mentor) , E. Congeduti (mentor)

The ability to model other agents can be of great value in multi-agent sequential decision making problems and has become more accessible due to the introduction of deep learning into reinforcement learning. In this study, the aim is to investigate the usefulness of modelling oth ...

Pacing regulation for runners

Master thesis (2022) - J.E. Molano Valencia (author) , Frans Oliehoek (mentor) , M. T.J. Spaan (coach) , Sebastian Feld (coach) , R.A.N. Starre (graduation committee member) , A. Nijs (graduation committee member)

By increasing the step frequency of the runners, it is possible to reduce the risk of injuries due to overload. Techniques like auditory pacing help the athletes to have better control over their step frequency. Nevertheless, synchronizing to a continuous external rhythm costs en ...

By increasing the step frequency of the runners, it is possible to reduce the risk of injuries due to overload. Techniques like auditory pacing help the athletes to have better control over their step frequency. Nevertheless, synchronizing to a continuous external rhythm costs energy. For this reason, the use of intermittent pacing may be more energy-efficient and more user-friendly for the athlete. We propose using experimental data from previous studies, that analyzed the response of runners to intermittent pacing, to find the most efficient approach for providing the pacing. For this purpose we use reinforcement learning techniques to learn and train our target behavior. This behavior is represented as the target policy and the experimental data is assumed to be sampled using a stochastic sampling policy. However, using only a single batch of initial training data presents a problem due to the continuously increasing difference between the initial sampling policy and the target policy being learned. The use of a batch off-policy algorithm with a standard deviation correction (OPPOSD) presented in (Liu et al., 2019) is then proposed. This algorithm benefits from the advantages of the sampling efficiency characteristic of the off-policy approaches and also introduces a fixing term to tackle the mismatch between the policies. To train and evaluate the learned policies based on the algorithm, a pace behavior simulator was developed from the data of the experiments. A Markov Decision Problem (MDP) was defined on top of the simulator that determines the rules of the pacing environment that the algorithm is set to learn. After translating the experimental data into MDP-like transitions, the OPPOSD algorithm is able to learn a relatively good target policy for the pacing problem. For a future application, the resulting trained model could be deployed for real runners while still having a continuous improvement of the policy in an on-policy or off-policy approach.

Coupled and Model-based cooperative planning in Overcooked AI

Bachelor thesis (2022) - N. van Veen (author) , Robert Loftin (mentor) , Frans A Oliehoek (mentor) , SE Verwer (graduation committee member)

In the field of cooperative AI, an environment is created called Overcooked AI based on the popular Overcooked game. Originally the environment is used to study deep reinforcement learning, on the other hand it also allows for cooperative planning methods of which the paper will ...