“…Transformers have become the foundation for many advanced language models, such as BERT, ChatGPT [23], and T5, and have significantly advanced the capabilities of language understanding and generation systems. Vision transformers (ViTs) [24] are an adaptation of the classical transformer architecture that apply self-attention mechanisms to process image data [25], making them an exemplary powerful model for tasks in computer vision, showcasing the extension of transformers' effectiveness beyond NLP. Figure 1 shows the relationship between AI, ML, DL, and Transformers.…”