Decreasing the number of demonstrations required for Inverse Reinforcement Learning by integrating human feedback

Oğurlu, Z.

Decreasing the number of demonstrations required for Inverse Reinforcement Learning by integrating human feedback

Bachelor thesis (2024)

Authors

Z. Oğurlu Electrical Engineering, Mathematics and Computer Science

Contributors

Luciano Cavalcante Siebert Interactive Intelligence (mentor)

A. Mone Interactive Intelligence (mentor)

Wendelin Böhmer Sequential Decision Making (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning (RL) Inverse Reinforcement Learning Machine Learning (ML) Human Feedback

To reference this document use:

http://resolver.tudelft.nl/uuid:f7d6ca83-4d2e-4713-9e0a-0129215e4f71

More Info

expand_more

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The main concept behind reinforcement learning is that an agent takes certain actions and is rewarded or punished for these actions. However, the rewards that are involved when performing a certain task can be quite complicated in real life and the contribution of different factors in the reward function is often unknown. From this problem emerges reward learning, which is the process of learning the reward function of an environment. There are several techniques for performing reward learning. We can view these different techniques within 2 different high-level categories: Learning from demonstrations and learning from feedback. IRL (Inverse Reinforcement Learning) is a way of learning from demonstrations. Meanwhile, RLHF (Reinforcement Learning from Human Feedback) is a way of learning from feedback.

In this paper, we are proposing the approach of training a reward learning agent, first with IRL and then with RLHF. IRL provides the benefit of learning a reward function quite quickly, however, it can suffer from the presence of sub-optimal demonstrations from the expert. Meanwhile, RLHF is slower at learning the reward function from scratch. Hence, we are proposing an approach where we integrate RLHF as a way to fine-tune the initial reward function calculated by IRL. By doing so, we are aiming to alleviate the negative effect of sub-optimal expert demonstrations on IRL.

We test and evaluate our methodology on the cart pole environment from the seals library. We compare the results from our approach to reward learning from only expert demonstrations, without integrating human feedback (i.e. only IRL). The obtained results suggest that, RLHF might in fact not be a good complement for IRL, specifically when we have sub-optimal expert demonstrations. In fact, we found that applying RLHF on top of IRL can even drop the performance of the resulting reward function, which challenges our initial hypothesis regarding the complementarity between these two methods.

Files

Zanyar_ogurlu_research_project... (pdf)

(pdf | 0.639 Mb)

Unknown license