“…An emerging thread of work aims at applying transformers to vision tasks such as object detection [5], semantic segmentation [115,99], 3D reconstruction [72], pose estimation [107], generative modeling [14], image retrieval [27], medical image segmentation [13,97,111], point clouds [40], video instance segmentation [103], object re-identification [47], video retrieval [33], video dialogue [64], video object detection [110] and multi-modal tasks [73,23,80,53,108]. A separate line of works attempts at modeling visual data with learnt discretized token sequences [104,83,14,109,18].…”