“…Afterwards, Transformer layers are trained for a downstream task on those features. With this approach, many works [52], [67], [78], [81], [86], [95], [101], [136], [141], [171] are still able to train the Transformer on small datasets (<10k training samples). However, it is definitely common to use medium to large datasets, as in [53], [54], [56], [57], [58], [59], [66], [93], [172].…”