Memory-efficient Transformers via Top-$k$ Attention

Gupta, Ankit; Dar, Guy; Goodman, Shaya; Ciprut, David; Berant, Jonathan

doi:10.48550/arxiv.2106.06899

Cited by 1 publication

(2 citation statements)

References 18 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It achieves sparsification by zeroing out input entries smaller than average and provides a training-time modification strategy to enable gradient-based training. This is indeed similar to the broadly adopted top-k selection of SoftMax output, e.g., in attention layers of vision (Wang et al, 2022b;Zhao et al, 2019) and language (Gupta et al, 2021) transformers. In contrast, our MultiMax achieves sparsity and improved multi-modality at the same time without extra hyperparameters.…”

Section: Related Worksupporting

confidence: 57%

“…To overcome the issue, previous works have proposed sparse SoftMax alternatives, which allow to completely ignore small entries below a threshold. These sparse SoftMax variants have been studied in diverse contexts, e.g., generative modeling (Chen et al, 2021), output activations of multiclass classifiers, and/or attention mechanisms (Peters et al, 2019;Martins & Astudillo, 2016;Gupta et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Modal Attention-Guided Convolutional Network for Multi-modal Cardiac Segmentation

Zhou

Guo

Yang

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multi-modality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive analysis and evaluation, we show that Multi-Max successfully produces a distribution that supresses irrelevant entries while preserving multimodality, with benefits in image classification, language modeling and machine translation. The code is available at https://github.com/ ZhouYuxuanYX/MultiMax.

show abstract

Section: Related Worksupporting

confidence: 57%

Section: Introductionmentioning

confidence: 99%