VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

⭐CVPR 2026 Highlight⭐

Juhye Park¹, Wooju Lee¹, Dasol Hong¹, Changki Sung¹, Youngwoo Seo², Dongwan Kang², Hyun Myung^1*

¹Urban Robotics Lab, School of Electrical Engineering, KAIST ²Hanwha Aerospace

arXiv Video

TL;DR VIRD explicitly resolves significant viewpoint gaps between ground and satellite views through a dual-axis transformation and a view-reconstruction loss, achieving state-of-the-art cross-view pose estimation without orientation priors.

VIRD qualitative results — Qualitative results of VIRD. CEPA attention weights (second column) consistently highlight regions corresponding to vertical structures such as building rooftops and road surfaces. The view-reconstruction outputs (third and fourth columns) show that the learned descriptors successfully capture cross-view consistent layouts. The final column shows pose estimation results overlaid on the satellite view.

Abstract

Accurate global localization is critical for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to facilitate horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to mitigate vertical misalignment, explicitly bridging the viewpoint gap. To further strengthen view invariance, we introduce a view-reconstruction loss that encourages the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Method

VIRD bridges the viewpoint gap between ground and satellite views by constructing view-invariant descriptors through three jointly optimized components.

(1) Dual-axis transformation. The viewpoint gap occurs along both horizontal and vertical axes of the image plane. For the horizontal axis, VIRD applies a polar transformation to the satellite view, mapping the azimuth direction as the horizontal axis to establish consistent cross-view correspondence. The remaining vertical discrepancy is addressed by context-enhanced positional attention (CEPA), which transforms ground and polar-transformed satellite features along the vertical axis via positional attention. Unlike prior geometry-based approaches, CEPA learns a view-consistent vertical transformation without camera parameters and adaptively captures vertical structures in the ground view by leveraging contextual cues.

(2) View-reconstruction loss. To further strengthen view invariance, the descriptors are trained to reconstruct both original and cross-view images. This reconstruction loss encourages the representations to focus on spatial structures shared across views, rather than view-specific appearance.

(3) Matching and regression. A matching module computes cosine similarity between ground and candidate satellite descriptors to estimate a coarse pose. A regression module then refines this estimate by predicting residual pose offsets, yielding the final 3-DoF camera pose.

VIRD motivation — Ground- and satellite-view images exhibit a large viewpoint gap along both horizontal and vertical axes. Previous geometry-based methods only partially resolve the horizontal discrepancy and struggle with vertical structures. VIRD overcomes both through dual-axis transformation.

VIRD pipeline — Overview of VIRD. The framework consists of (1) dual-axis descriptor construction with vertical directional encoding, (2) view-reconstruction loss, and (3) matching and regression for final 3-DoF pose estimation.

Results

KITTI

50.7% median position error ↓

76.5% median orientation error ↓

VIGOR

18.0% median position error ↓

46.8% median orientation error ↓

Reduction in median errors vs. state-of-the-art without orientation priors (EfficientNet-B0, cross-area setting).

KITTI results table — Table 1. Position and orientation estimation errors on the KITTI dataset without orientation priors.

VIGOR results table — Table 2. Position and orientation estimation errors on the VIGOR dataset.

Video

BibTeX

@inproceedings{park2026vird,
  title     = {VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation},
  author    = {Park, Juhye and Lee, Wooju and Hong, Dasol and Sung, Changki and Seo, Youngwoo and Kang, Dongwan and Myung, Hyun},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}