“…As manual affordance annotations are often costly to acquire, much subsequent research has shifted its focus to weak supervision such as keypoints [16,53,54] or image-level labels [36,43]. Recent work has explored a novel perspective on how to ground affordances from human-object interaction images [29,36,64] or human action videos [9,19,31,43]. In robotics, affordance learning enables robots to interact effectively and intelligently with complex and dynamic environments [2,63].…”