“…Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection [2,56], image classification [8,41,52,46,14], segmentation [44], multiple object tracking [37,29], human pose estimation [50,55], point cloud processing [12,54], video processing [10,31,38], image super-resolution [30,49,3], image synthesis [9], etc. An extensive review is out of the scope of this paper.…”