“…Additionally, the proximity of the referent to adjacent objects in the scene can lead to localization errors [2,1,4,40], and view-dependent descriptions can result in poor localization performance for referent localization based on spatial terms [15,36,40,14,13,39,12]. There are also localization errors when locating a unique referent among multiple visually similar objects [2,21,15,36,40,14,13,39,12,19,1,4]. Our approach introduces a new task of 3D visual grounding in a humanin-the-loop-based scenario, where body gestures are integrated into the scene to mitigate localization errors resulting from sparse, noisy, and semantically limited point clouds, object proximity, difficulty in distinguishing a unique referent among visually similar objects, and view-dependent descriptions.…”