“… 13 , 14 , 15 Recent studies use attention-based models to construct slide-level representation by aggregating weighted tile-level features, 12 , 16 , 17 , 18 , 19 e.g., via multi-head attention (MHA), hierarchical attention, dual attention, or convolutional block attention modules. 12 , 16 , 20 , 21 , 22 , 23 , 24 Other important approaches have included multiscale attention and vision transformer models, which utilize the correlations across tiles to improve slide-level representations. 20 , 25 , 26 , 27 , 28 Fully CNN-based approaches have also been considered for attention-like functions, e.g., encoding of tiles by a CNN followed by an additional deep CNN, 29 possibly with multi-scale tiling.…”