“…Researchers try to understand the 3D shapes, axes, movable parts and affordance on synthetic data [42,64,43,25,62,32,60], videos [47,21,20,44] or point clouds [26]. Our work is mostly related to [47,21,20] since they work on real images, but is different from them on two aspectives: First, they need video or multi-view inputs, but our input is only a single image. Second, their approaches recover the objects which are being interacted, while our approach understands potential interactions before any interactions happen.…”