SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation

Lentsch, Ted; Xia, Z.; Caesar, Holger; Kooij, J. F. P.

doi:10.1109/CVPR52729.2023.01652

SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation

Conference paper (2023)

Authors

Ted Lentsch Intelligent Vehicles

Z. Xia Intelligent Vehicles

Holger Caesar Intelligent Vehicles

J. F. P. Kooij Intelligent Vehicles

Research Group

Intelligent Vehicles

DOI: https://doi.org/10.1109/CVPR52729.2023.01652

To reference this document use:

http://resolver.tudelft.nl/uuid:735cc5c8-c107-40cc-911a-e82cc62b18a9

More Info

expand_more

Published Date

2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Research Group

Intelligent Vehicles

Abstract

This work addresses cross-view camera pose estimation, i.e., determining the 3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial image of the local area. We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor. The feature extractors extract dense features from the ground and aerial images. Given a set of candidate camera poses, the feature aggregators construct a single ground descriptor and a set of pose-dependent aerial descriptors. Notably, our novel aerial feature aggregator has a cross-view attention module for ground-view guided aerial feature selection and utilizes the geometric projection of the ground camera's viewing frustum on the aerial image to pool features. The efficient construction of aerial descriptors is achieved using precomputed masks. SliceMatch is trained using contrastive learning and pose estimation is formulated as a similarity comparison between the ground descriptor and the aerial descriptors. Compared to the state-of-the-art, SliceMatch achieves a 19% lower median localization error on the VIGOR benchmark using the same VGG16 backbone at 150 frames per second, and a 50% lower error when using a ResNet50 backbone.

Files

SliceMatch_Geometry_Guided_Agg... (pdf)

(pdf | 1.26 Mb)

- Embargo expired in 22-02-2024

Unknown license