VIRD

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

⭐CVPR 2026 Highlight⭐
1Urban Robotics Lab, School of Electrical Engineering, KAIST    2Hanwha Aerospace
TL;DR VIRD explicitly resolves significant viewpoint gaps between ground and satellite views through a dual-axis transformation and a view-reconstruction loss, achieving state-of-the-art cross-view pose estimation without orientation priors.
VIRD qualitative results
Qualitative results of VIRD. CEPA attention weights (second column) consistently highlight regions corresponding to vertical structures such as building rooftops and road surfaces. The view-reconstruction outputs (third and fourth columns) show that the learned descriptors successfully capture cross-view consistent layouts. The final column shows pose estimation results overlaid on the satellite view.

Abstract

Accurate global localization is critical for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to facilitate horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to mitigate vertical misalignment, explicitly bridging the viewpoint gap. To further strengthen view invariance, we introduce a view-reconstruction loss that encourages the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Method

VIRD bridges the viewpoint gap between ground and satellite views by constructing view-invariant descriptors through three jointly optimized components.

(1) Dual-axis transformation. The viewpoint gap occurs along both horizontal and vertical axes of the image plane. For the horizontal axis, VIRD applies a polar transformation to the satellite view, mapping the azimuth direction as the horizontal axis to establish consistent cross-view correspondence. The remaining vertical discrepancy is addressed by context-enhanced positional attention (CEPA), which transforms ground and polar-transformed satellite features along the vertical axis via positional attention. Unlike prior geometry-based approaches, CEPA learns a view-consistent vertical transformation without camera parameters and adaptively captures vertical structures in the ground view by leveraging contextual cues.

(2) View-reconstruction loss. To further strengthen view invariance, the descriptors are trained to reconstruct both original and cross-view images. This reconstruction loss encourages the representations to focus on spatial structures shared across views, rather than view-specific appearance.

(3) Matching and regression. A matching module computes cosine similarity between ground and candidate satellite descriptors to estimate a coarse pose. A regression module then refines this estimate by predicting residual pose offsets, yielding the final 3-DoF camera pose.

VIRD motivation
Ground- and satellite-view images exhibit a large viewpoint gap along both horizontal and vertical axes. Previous geometry-based methods only partially resolve the horizontal discrepancy and struggle with vertical structures. VIRD overcomes both through dual-axis transformation.
VIRD pipeline
Overview of VIRD. The framework consists of (1) dual-axis descriptor construction with vertical directional encoding, (2) view-reconstruction loss, and (3) matching and regression for final 3-DoF pose estimation.

Results

KITTI
50.7% median position error ↓
76.5% median orientation error ↓
VIGOR
18.0% median position error ↓
46.8% median orientation error ↓

Reduction in median errors vs. state-of-the-art without orientation priors (EfficientNet-B0, cross-area setting).

KITTI results table
Table 1. Position and orientation estimation errors on the KITTI dataset without orientation priors.
VIGOR results table
Table 2. Position and orientation estimation errors on the VIGOR dataset.

Video

BibTeX

@inproceedings{park2026vird,
  title     = {VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation},
  author    = {Park, Juhye and Lee, Wooju and Hong, Dasol and Sung, Changki and Seo, Youngwoo and Kang, Dongwan and Myung, Hyun},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}