This paper presents an encoder-decoder-style convolutional neural network (CNN) for the purpose of improving monocular and stereo depth estimation (SDE) estimates, by combining them with the corresponding monocular estimates through a fusion network, assisted by prior information
...
This paper presents an encoder-decoder-style convolutional neural network (CNN) for the purpose of improving monocular and stereo depth estimation (SDE) estimates, by combining them with the corresponding monocular estimates through a fusion network, assisted by prior information to provide context for the fusion. Video cameras are commonly used for depth perception in robotics, especially weight-sensitive applications, such as on Micro Aerial Vehicles (MAV). The two primary paradigms for vision-based depth perception are monocular and stereo depth or disparity estimation, each having their own strengths and weaknesses. These strengths and weaknesses seem to be complementary, and thus a fusion of the two may result in more accurate predictions. In this paper, we investigate this fusion by training a CNN that combines stereo and monocular depth or disparity estimates. The fusion network is agnostic to the choice of the input networks, providing great flexibility. It was found that such a fusion network, while increasing the computational complexity of the depth perception pipeline, indeed improves the accuracy of the estimates. The number of outlier predictions has been significantly decreased, while also limiting some fundamental limitations of both stereo and monocular methods, such as errors arising from occluded regions.