“…Learning Actionable Visual Representations aims for learning visual representations that are strongly aware of downstream robotic manipulation tasks and directly indicative of action probabilities for robotic executions, in contrast to predicting standardized visual semantics, such as category labels [48,49], segmentation masks [50,22], and object poses [51,52], which are usually defined independently from any specific robotic manipulation task. Grasping [53,54,55,56,57,58,59,60,61] or manipulation affordance [62,13,63,64,14,65,66,67,15] is one major kind of actionable visual representations, while many other types have been also explored recently (e.g., spatial maps [68,69], keypoints [70,71], contact points [72], etc). Following the recent work Where2Act [15], we employ dense affordance maps as the actionable visual representations to suggest action possibility at every point on 3D articulated objects.…”