Attention mechanisms have been explored with CNNs across the spatial and channel dimensions. However, all the existing methods devote the attention modules to capture local interactions from a uni-scale. This paper tackles the following question: can one consolidate multi-scale aggregation while learning channel attention more efficiently? To this end, we avail channel-wise attention over multiple feature scales, which empirically shows its aptitude to replace the limited local and uni-scale attention modules. EMCA is lightweight and can efficiently model the global context further; it is easily integrated into any feed-forward CNN architectures and trained in an end-to-end fashion. We validate our novel architecture through comprehensive experiments on image classification, object detection, and instance segmentation with different backbones. Our experiments show consistent gains in performances against their counterparts, where our proposed module, named EMCA, outperforms other channel attention techniques in accuracy and latency trade-off. More specifically, compared to SENet, we boost the accuracy by 0.8 %, 0.6 %, and 1 % on ImageNet benchmark for ResNet-18, 34, and 50, respectively. For detection and segmentation tasks, MS-COCO are for benchmarking, Our EMCA module boost the accuracy by 0.5 % and 0.3 %, respectively. We also conduct experiments that probe the robustness of the learned representations. Our code will be published once the paper is accepted.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.