<div class="section abstract"><div class="htmlview paragraph">Many learning-based methods estimate ego-motion using visual sensors. However,
visual sensors are prone to intense lighting variations and textureless
scenarios. 4D radar, an emerging automotive sensor, complements visual sensors
effectively due to its robustness in adverse weather and lighting conditions.
This paper presents an end-to-end 4D radar-visual odometry (4DRVO) approach that
combines sparse point cloud data from 4D radar with image information from
cameras. Using the Feature Pyramid, Pose Warping, and Cost Volume (PWC) network
architecture, we extract 4D radar point features and image features at multiple
scales. We then employ a hierarchical iterative refinement approach to supervise
the estimated pose. We propose a novel Cross-Modal Transformer (CMT) module to
effectively fuse the 4D radar point modality, image modality, and 4D radar
point-image connection modality at multiple scales, achieving cross-modal
feature interaction and multi-modal feature fusion. Additionally, we designed a
point confidence estimation module to mitigate the impact of dynamic objects on
odometry estimation. Extensive experiments were conducted on the View-of-Delft
(VoD) dataset, showcasing the remarkable performance and effectiveness of the
proposed 4D radar-visual odometry method.</div></div>