“…Finally, to compare BERT with other models available in the HuggingFace Transformers library (RQ3), we experiment with two recent Transformer-based architectures: (1) DeBERTa [15] -a model [52], [40], [68], [38], [29], [7], [64], [55], [62], [18], [60], [53] [69], [32], [65], [58], [47] [3], [9] ML-1M* [14] 18 13 (72%) 3 (17%) 2 (11%) [52], [40], [68], [29], [7], [23], [55], [18], [60], [51], [42], [46], [28] [12], [62], [47] [64], [9] Yelp [2] 10 6 (60%) 4 (40%) 0 (0%) [69], [1], [3], [58], [47], [42] [12], [32], [65], [33] Steam that improves BERT with a disentangled attention mechanism [15] where each word is encoded using two vectors (a vector for content and a vector for position); (2) ALBERT [27] -a model that improves BERT via separating the size of the hidden state of the vocabulary embedding from t...…”