“…It is aimed at establishing correspondences between instances from different modalities [3], either spatial or semantic. To achieve this, many methods use camera intrinsics to correspond spatial position of pixels and points, then align per pixel-point feature or fuse the raw [13,34,47] Image/Video 2D Heatmap Grounding [11,44,69] Point Cloud 3D Heatmap Grounding data [25,62,65,71,77,80]. Some works utilize depth information to project image features into 3D space and then fuse them with point-wise features [20,42,73,75,76].…”