Multi-class road user detection using the next- generation, 3+1D (range, azimuth, elevation, and Doppler) radars has been shown feasible, thanks to the increased density of their point clouds and the inclusion of elevation information. However, object detection networks using LiD
...
Multi-class road user detection using the next- generation, 3+1D (range, azimuth, elevation, and Doppler) radars has been shown feasible, thanks to the increased density of their point clouds and the inclusion of elevation information. However, object detection networks using LiDAR (64-layer) point clouds still dominate the performance metrics. In this work, we explore the potential of fusing a 3+1D radar point cloud and a monocular image to further close this performance gap in 3D object detection. We propose a generic and modular fusion architecture to extract both spatial and semantic cues from an RGB image to complement the radar point cloud. In a two-stage approach, we first generate a 3D point cloud representation of the input monocular image appended with semantic information through our proposed RAID (RAdar guided Instance-aware Depth) network, which takes monocular depth map and panoptic masks predicted from any pre-trained state-of-the-art networks, and a radar depth map as input. We then append the resulting point cloud to the 3+1D radar point cloud in a straightforward fusion scheme and train a point cloud-based object detection network. Results on the View-of-Delft dataset [1] show that our fusion approach significantly outperforms multiple state-of-the-art radar-camera fusion methods (proposed fusion vs. best baseline: 53.6 mAP vs. 50.8 mAP), and yields comparable performance to a network trained on LiDAR input when evaluated in the safety-critical driving corridor (80.5 mAP vs. 81.6 mAP).