“…Nowadays, transformer architectures (e.g., [ 3 , 11 , 12 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 ]) are seen as the state of the art for deep learning of the type of present interest. As per the definition of the contrastive learning framework mentioned in [ 61 , 66 ], we add an extra autoencoder in which the encoder behaves as a projection head. The outputs of the transformer encoder, which we regard as representations, are to be of a higher dimension.…”