Reinforcement Learning from Human Feedback (RLHF) offers a powerful approach to training agents in environments where defining an explicit reward function is challenging by learning from human feedback provided in various forms. This research evaluates three common feedback types
...
Reinforcement Learning from Human Feedback (RLHF) offers a powerful approach to training agents in environments where defining an explicit reward function is challenging by learning from human feedback provided in various forms. This research evaluates three common feedback types within RLHF: Scalar Feedback, Binary Comparison Feedback, and Binary Comparison with a preference strength margin. Synthetic feedback is used to replace real human feedback to address cost and time constraints. Simplified RLHF setups using Q-learning are initially implemented in a grid environment to ensure the robustness of the methods. Subsequent experiments are conducted in more complex environments using the Imitation library and PPO from Stable Baselines3. Our findings demonstrate the efficacy of various feedback types, highlighting the trade-offs between ease of use for human feedback providers and the amount of information conveyed. This comparative analysis provides insights into optimizing RLHF systems for improved agent performance. Full code is available online in the supplementary material https://github.com/navimakarov/rlhf-feedback-variety.