DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Scribano, Carmelo; Franchini, Giorgia; Prato, Marco; Bertogna, Marko

doi:10.1007/s10915-023-02125-5

Cited by 11 publications

(11 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These include the Linformer (Wang et al, 2020 ), which approximates the self-attention matrix with a low-rank matrix, and the Reformer (Kitaev et al, 2020 ), which introduces locality-sensitive hashing to accelerate self-attention computation. DCT-Former (Scribano et al, 2023 ) achieves efficient self attention computation by introducing discrete cosine transform as a frequency domain based conversion method. By calculating attention weights in the frequency domain, DCT-Former can significantly reduce computational complexity while maintaining high performance, improving the efficiency and scalability of the model.…”

Section: Related Workmentioning

confidence: 99%

“…First, we performed the Kerformer model and the remaining five models [Performer (Choromanski et al, 2020 ), Reformer (Kitaev et al, 2020 ), and Liner Trans (Katharopoulos et al, 2020 ), Longformer (Beltagy et al, 2020 ), RFA (Peng et al, 2021 ), and Dct-former (Scribano et al, 2023 )] were compared in terms of accuracy. This was achieved by conducting comparative fine-tuning experiments on five datasets, including GLUE (QQP, SST-2, MNLI) (Wang et al, 2018 ), IMDB (Maas et al, 2011 ), and Amazon (Ni et al, 2019 ).…”

Section: Nlp Taskmentioning

confidence: 99%

“…We evaluated our approach on various tasks, including long sequence ListOps (Nangia and Bowman, 2018 ), byte-level text classification (Maas et al, 2011 ), document retrieval using ACL selection networks (Radev et al, 2013 ), and Pathfinder (Linsley et al, 2018 ). While comparing with our Kerformer model with Local Attention (Tay et al, 2020 ), Reformer (Kitaev et al, 2020 ), Performer (Choromanski et al, 2020 ), Longformer (Choromanski et al, 2020 ), Transformer (Vaswani et al, 2017 ), BigBird (Zaheer et al, 2020 ), and Dct-former (Scribano et al, 2023 ) models, the comparison results of the seven different models are shown in Table 5 . As shown in Table 5 , Kerformer obtained the best performance in ListOps, Document Retrieval, while Kerformer also achieved competitive results in the other two tasks, and finally Kerformer achieved the next best score in overall task average accuracy.…”

Section: Nlp Taskmentioning

confidence: 99%

“…In comparison to state-of-the-art methods in self-attention and transformer architectures, our proposed Kerformer introduces a novel and efficient approach to self-attention computation. While previous works, such as Linformer (Wang et al, 2020 ), Reformer (Kitaev et al, 2020 ), DCT-Former (Scribano et al, 2023 ), LISA (Wu et al, 2021 ), and Bernoulli sampling attention mechanism (Zeng et al, 2021 ), have made significant strides in reducing computational costs and improving efficiency, they still rely on dot product similarity and may have limitations on sequence length and global dependencies.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A novel approach to attention mechanism using kernel functions: Kerformer

Gan,

Fu,

Wang

et al. 2023

Front. Neurorobot.

View full text Add to dashboard Cite

Artificial Intelligence (AI) is driving advancements across various fields by simulating and enhancing human intelligence. In Natural Language Processing (NLP), transformer models like the Kerformer, a linear transformer based on a kernel approach, have garnered success. However, traditional attention mechanisms in these models have quadratic calculation costs linked to input sequence lengths, hampering efficiency in tasks with extended orders. To tackle this, Kerformer introduces a nonlinear reweighting mechanism, transforming maximum attention into feature-based dot product attention. By exploiting the non-negativity and non-linear weighting traits of softmax computation, separate non-negativity operations for Query(Q) and Key(K) computations are performed. The inclusion of the SE Block further enhances model performance. Kerformer significantly reduces attention matrix time complexity from O(N2) to O(N), with N representing sequence length. This transformation results in remarkable efficiency and scalability gains, especially for prolonged tasks. Experimental results demonstrate Kerformer's superiority in terms of time and memory consumption, yielding higher average accuracy (83.39%) in NLP and vision tasks. In tasks with long sequences, Kerformer achieves an average accuracy of 58.94% and exhibits superior efficiency and convergence speed in visual tasks. This model thus offers a promising solution to the limitations posed by conventional attention mechanisms in handling lengthy tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Nlp Taskmentioning

confidence: 99%

Section: Nlp Taskmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A novel approach to attention mechanism using kernel functions: Kerformer

Gan,

Fu,

Wang

et al. 2023

Front. Neurorobot.

View full text Add to dashboard Cite

show abstract

“…While these methods are easy to implement, they are less adaptable and stable, and the denoised images are not sharp enough. Transform domain denoising methods [7] can effectively avoid image distortion by transforming the image into the frequency domain for processing, including Fourier transform [8], discrete cosine transform [9], and wavelet transform [10], but such methods usually have a high complexity and uncertainty. Moreover, deep-learning-based denoising methods enhance the denoising effect to some extent due to their strong feature representation capabilities, but they face higher training-set requirements and more time-consuming computational cost [11].…”

Section: Introductionmentioning

confidence: 99%

An Unsupervised Image Denoising Method Using a Nonconvex Low-Rank Model with TV Regularization

et al. 2023

View full text Add to dashboard Cite

In real-world scenarios, images may be affected by additional noise during compression and transmission, which interferes with postprocessing such as image segmentation and feature extraction. Image noise can also be induced by environmental variables and imperfections in the imaging equipment. Robust principal component analysis (RPCA), one of the traditional approaches for denoising images, suffers from a failure to efficiently use the background’s low-rank prior information, which lowers its effectiveness under complex noise backgrounds. In this paper, we propose a robust PCA method based on a nonconvex low-rank approximation and total variational regularization (TV) to model the image denoising problem in order to improve the denoising performance. Firstly, we use a nonconvex γ-norm to address the issue that the traditional nuclear norm penalizes large singular values excessively. The rank approximation is more accurate than the nuclear norm thanks to the elimination of matrix elements with substantial approximation errors to reduce the sparsity error. The method’s robustness is improved by utilizing the low sensitivity of the γ-norm to outliers. Secondly, we use the l1-norm to increase the sparsity of the foreground noise. The TV norm is used to improve the smoothness of the graph structure in accordance with the sparsity of the image in the gradient domain. The denoising effectiveness of the model is increased by employing the alternating direction multiplier strategy to locate the global optimal solution. It is important to note that our method does not require any labeled images, and its unsupervised denoising principle enables the generalization of the method to different scenarios for application. Our method can perform denoising experiments on images with different types of noise. Extensive experiments show that our method can fully preserve the edge structure information of the image, preserve important features of the image, and maintain excellent visual effects in terms of brightness smoothing.

show abstract