This work addresses visual localization of intelligent vehicles as an alternative to traditional GPS- of HD map-based localization options. Specifically, the problem of Cross-View Pose Estimation (CVPE) is explored, which involves estimating the vehicle pose within an encompassin
...
This work addresses visual localization of intelligent vehicles as an alternative to traditional GPS- of HD map-based localization options. Specifically, the problem of Cross-View Pose Estimation (CVPE) is explored, which involves estimating the vehicle pose within an encompassing aerial patch, given a ground image from the on-board camera feed. The aerial patch containing the ground truth pose can be obtained through a rough localization prior, such as GPS. We find that existing CVPE methods start with a location prior that is too coarse given both the GPS performance and the required localization error. Therefore, we define a fine-grained localization setting and propose three approaches, targeting performance, interpretability, and data efficiency. Furthermore, the approaches have a unique capacity to predict a 6-DoF camera pose. Two approaches involve matching point-level local features in 3D space using a novel point cross-attention, while the last one aims to tailor an existing dense feature matching method to the fine-grained setting. Despite quantitative performance of the local feature matching approaches being inferior to the state-of-the-art, we establish a new state-of-the-art on the fine-grained setting with the improved dense-feature baseline. Nevertheless, we show the key limitations of the local feature matching, namely the influence of the “unmatchable” queries. Furthermore, using a 6-DoF projective transformation we discover severe issues with the ground truth quality on the KITTI dataset, commonly used in CVPE literature, potentially accounting to the large degree to the substandard performance of most available CVPE methods. Finally, our local feature matching methods demonstrate the capability of predicting pitch and roll angles of the camera, estimating which has not yet been attempted in CVPE.