MultiViT: 2D to 3D transfer learning using a jointly optimized Vision Transformer without the need for image labels

More Info
expand_more

Abstract

2D to 3D transfer learning has proven to be an effective method to transfer strong 2D representations into a 3D network. Recent advancements show that casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, such as through range projection, enables the processing of 3D point clouds with a Vision Transformer (ViT). For autonomous vehicles, collecting camera data alongside point cloud data is inexpensive. In this work, we investigate how we can best use these available 2D images, by exploring the benefit of camera pretraining to LiDAR semantic segmentation by using a shared ViT backbone. We propose a label generation method that generates pixel-wise pseudo labels for the camera images from the LiDAR annotations. We show that we can effectively use these for pretraining before finetuning on point cloud semantic segmentation. Besides, we compare jointly optimizing a shared ViT encoder for image and point cloud semantic segmentation simultaneously, with finetuning on point cloud semantic segmentation after pretraining on the camera domain. We show that joint optimization can lead to improved performance compared to pretraining when training from scratch. By using a shared encoder for the camera and LiDAR domain we can investigate the joint Camera-LiDAR feature space and find it is possible to create a shared feature space where LiDAR and camera features from the same class are mapped to the same location in the feature space. However, this does not contribute to a better performance on LiDAR semantic segmentation. These experiments provide valuable insight into 2D to 3D transfer learning and the creation of a shared Camera and LiDAR feature space.