“…For example, the widely used RGB-based tracking datasets, GOT-10k [13], Track-ingNet [36], and LaSOT [9], contain 9.3K, 30.1K, and 1.1K sequences, corresponding to 1.4M, 14M, and 2.8M frames for training. Whereas the largest training datasets in multi-modal tracking, DepthTrack [47], LasHeR [25], VisEvent [43], contain 150, 979, 500 training sequences, corresponding to 0.22M, 0.51M, 0.21M annotated frame pairs, which is at least an order of magnitude less than the former. Accounting for the above limitation, multi-modal tracking methods [43,47,61] usually utilize pre-trained RGB-based trackers and perform fine-tuning on their taskoriented training sets (as shown in Figure 1 (a)→(b)).…”