“…4) Transferability: Although third-person videos are more accessible, in cases such as robot manipulation where the agent observes the environment and objects from the firstperson perspective, it is necessary to transfer the knowledge from third-person videos to first-person scenarios. However, transferring affordance grounding knowledge between different perspectives [78] is still under-explored and of practical meaning.…”