Convolutional Neural Networks (CNN) have received a large share of research in mammography image analysis due to their capability of extracting hierarchical features directly from raw data. Recently, Vision Transformers are emerging as viable alternative to CNNs in medical imaging, in some cases performing on par or better than their convolutional counterparts. In this work, we conduct an extensive experimental study to compare the most recent CNN and Vision Transformer architectures for whole mammograms classification. We selected, trained and tested 33 different models, 19 convolutional- and 14 transformer-based, on the largest publicly available mammography image database OMI-DB. We also performed an analysis of the performance at eight different image resolutions and considering all the individual lesion categories in isolation (masses, calcifications, focal asymmetries, architectural distortions). Our findings confirm the potential of visual transformers, which performed on par with traditional CNNs like ResNet, but at the same time show a superiority of modern convolutional networks like EfficientNet.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.