“…Our encoder–decoder model is based on a vision transformer and generates heatmaps to locate the pitch keypoints. Previous sports field registration methods use models based on convolutions [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 ], which are limited by their receptive fields. The attention mechanisms of our vision transformer encoder [ 14 ] can capture characteristic pitch features globally in the frames.…”