“…Existing deformable object manipulation approaches typically use one modality (mostly vision) and rely on finite element/particle-based techniques [6,7,8,9,10,11,12,13] or leverage deep learning for visual affordance/latent dynamics learning [14,15,16,17,18,19]. The former methods typically rely on privileged knowledge (e.g., occluded or unknown boundary conditions) and stop at system identification, limiting their ability to refine the underlying physics model by learning from data.…”