Visio-Verbal Teleimpedance
A Gaze and Speech-Driven VLM Interface for Human-Centric Semi-Autonomous Robot Stiffness Control
More Info
expand_more
Abstract
Three-year-old toddlers can effortlessly guide a toy train along a wooden track, whereas this slide-in-the-groove position tracking task requires a skilled operator using a teleoperated robot arm due to the lack of direct contact and force feedback. Although an autonomous robot can perform this task in a fixed setup, telerobotics is crucial for unknown environments where human control is essential, as humans provide the adaptability needed to handle unpredictable conditions. The introduction of torque-controlled motors and haptic devices has enhanced teleoperation by improving telepresence and immersion. Operators can perceive interaction forces through the primary position control input via the haptic device, while a secondary control input allows them to adjust the robot arm's impedance. This ability, known as teleimpedance, allows operators to control the robot’s physical interaction based on environmental context. A toddler naturally remains relaxed in the plane perpendicular to the train's forward direction, where gravity and the groove sides provide stability, preventing derailment and wheel damage. At the same time, they maintain firmness along the track for smooth forward movement. Tele-impedance enables the operator to achieve a similar balance. It allows adaptation of an optimal balance of low and high impedance in different axes of Cartesian space to match task demands. Current impedance control interfaces rely on complex muscle activity measurements, requiring long calibration procedures to map the operator’s arm stiffness to the robot arm. Other interfaces use hand-controlled input devices that must be operated in addition to the haptic device, reducing the operator's cognitive bandwidth for the position tracking task. Existing interfaces typically provide only partial stiffness control or introduce visual distractions. In contrast, we propose a novel visio-verbal interface that leverages gaze and speech, natural modes of interaction, to enable hands-free semi-autonomous control of translational stiffness in all three dimensions while maintaining visual attention on the position tracking task. The interface’s vision-language model (VLM) determines the three-dimensional robot endpoint stiffness by combining the operator's verbal intent with gaze estimates from a mobile eye tracker. We demonstrate a proof of concept for this approach. The hardware includes Tobii Pro Glasses 2 eye trackers, a Force Dimension sigma7 haptic position input interface, and a KUKA LBR iiwa collaborative robot arm equipped with a custom-built endpoint camera mount for the Realsense D455 camera and a 3D-printed peg to evaluate the interface in a 3D-printed U-shaped slot for a slide-in-the-groove task similar to guiding a toy train.