Transformers have revolutionized the fields of Natural Language Processing and Computer Vision - a result of their ability to capture long-range dependencies with their key innovation: the attention mechanism. Despite the success of these models, their growing complexity has led to an ever-increasing need for processing power, making their practical applications less feasible. In recent years, tensor decomposition-based parameter-efficient fine-tuning techniques have emerged as a promising solution to the computational bottleneck. In this research, we investigate the use of a modified version of Factor Tuning that lessens inter-layer associations that the original Factor Tuning creates and focuses exclusively on attention mechanisms. We refer to this method as Self-Attention Factor-Tuning. To evaluate the effectiveness of our approach, we conduct experiments with Vision Transformers using all 19 datasets from the VTAB-1k benchmark for image classification. The results demonstrate that the proposed framework effectively reduces the number of parameters required to fine-tune a transformer, achieving new state-of-the-art performance on three of the 19 datasets in the benchmark and outperforming the original Factor-Tuning paradigm as well as various other competitive techniques, whilst using significantly fewer parameters.