Many recent robot learning problems, real and simulated, were addressed using deep reinforcement learning. The developed policies can deal with high-dimensional, continuous state and action spaces, and can also incorporate machine-generated or human demonstration data. A great nu
...
Many recent robot learning problems, real and simulated, were addressed using deep reinforcement learning. The developed policies can deal with high-dimensional, continuous state and action spaces, and can also incorporate machine-generated or human demonstration data. A great number of them depend on state-action value estimates, especially the ones in the actor-critic framework. Deriving unbiased estimates for these values is still an open research question, mostly since the connection between accurate value estimates and system performance is not yet well-understood. This thesis work has three main research contributions. Firstly, it analyzes the connection between value estimates and performance for the TD3 algorithm. Secondly, it derives theoretical bounds for the true value function when dealing with environments where a reward is only given for successful completion of a task (sparse/binary reward). Lastly, a deliberate underestimation objective is added to the TD3 algorithm together with the theoretical bounds to improve system performance when using human demonstration data that only covers a specific part of the state and action space. All the algorithms are tested and evaluated using simulated robot manipulation tasks in the robosuite environment, where the robot is first trained on the demonstration data and then can gather more experiences in the simulation. Results show that the deliberate underestimation together with the value bounds enable the robot to learn from human demonstration, which was not possible for the standard TD3. Additionally, applying just the value bounds speeds up the learning process when using machine-generated datasets.