Towards Language-Guided Visual Recognition via Dynamic Convolutions

Luo, Gen; Zhou, Yiyi; Sun, Xiaoshuai; Wu, Yongjian; Gao, Yue; Ji, Rongrong

doi:10.1007/s11263-023-01871-1

Cited by 9 publications

(4 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using less than 65,000 of the original 85 million trainable parameters, SAFT outperforms both full fine-tuning and linear probing as most known PEFT methods, including VPT-Shallow [10], VPT-Deep [10], Adapt-Former [14], LoRA [9], and FacT [11] in terms of absolute performance (accuracy). When compared to SSF [15] and RepAdapter [16], SAFT achieves competitive performance whilst using significantly fewer parameters (roughly 25%). Notably, SAFT achieves new state-of-the-art performance on three of the 19 datasets (Flower102, Pets, and EuroSAT).…”

Section: Resultsmentioning

confidence: 99%

“…RepAdapter [16] introduces a re-parameterizable, linear projection adapter block for Vision Transformers (ViT), primarily integrated into Multi-Head Attention (MHA) and Feed-Forward Network (FFN) layers. Its formulation, devoid of non-linear activation for simplicity, is given as:…”

Section: Repadaptermentioning

confidence: 99%

“…Language-guided Dynamic Convolution (LaConv) [18], proposes a novel method for language-guided visual recognition by dynamically generating convolution kernels based on the input text. In LaConv, the convolution operation is redefined to incorporate language features, allowing for the generation of dynamic kernels.…”

Section: Language-guided Dynamic Convolution (Laconv)mentioning

confidence: 99%

“…To gauge the effectiveness of our proposed method, we compare SAFT to various well-established baselines in the realm of parameter-efficient fine-tuning, including VPT-shallow [10], VPT-Deep [10], AdaptFormer [14], LoRA [9], SSF [15], and FacT [11], RepAdapter [16], using their reported numbers. This comparative analysis aids us in situating our work within the broader context of the field, providing a clear reference for the evaluation of the practical impact of our contribution.…”

Section: Comparisonsmentioning

confidence: 99%

See 3 more Smart Citations

Self-Attention Factor-Tuning for Parameter Efficient Fine-Tuning

Abohwo

2024

Preprint

View full text Add to dashboard Cite

Transformers have revolutionized the fields of Natural Language Processing and Computer Vision - a result of their ability to capture long-range dependencies with their key innovation: the attention mechanism. Despite the success of these models, their growing complexity has led to an ever-increasing need for processing power, making their practical applications less feasible. In recent years, tensor decomposition-based parameter-efficient fine-tuning techniques have emerged as a promising solution to the computational bottleneck. In this research, we investigate the use of a modified version of Factor Tuning that lessens inter-layer associations that the original Factor Tuning creates and focuses exclusively on attention mechanisms. We refer to this method as Self-Attention Factor-Tuning. To evaluate the effectiveness of our approach, we conduct experiments with Vision Transformers using all 19 datasets from the VTAB-1k benchmark for image classification. The results demonstrate that the proposed framework effectively reduces the number of parameters required to fine-tune a transformer, achieving new state-of-the-art performance on three of the 19 datasets in the benchmark and outperforming the original Factor-Tuning paradigm as well as various other competitive techniques, whilst using significantly fewer parameters.

show abstract