Heart rate estimation from face videos is an emerging technology that offers numerous potential applications in healthcare and human–computer interaction. However, most of the existing approaches often overlook the importance of long-range spatiotemporal dependencies, which is essential for robust measurement of heart rate prediction. Additionally, they involve extensive pre-processing steps to enhance the prediction accuracy, resulting in high computational complexity. In this paper, we propose an innovative solution called LGTransPPG. This end-to-end transformer-based framework eliminates the need for pre-processing steps while achieving improved efficiency and accuracy. LGTransPPG incorporates local and global aggregation techniques to capture fine-grained facial features and contextual information. By leveraging the power of transformers, our framework can effectively model long-range dependencies and temporal dynamics, enhancing the heart rate prediction process. The proposed approach is evaluated on three publicly available datasets, demonstrating its robustness and generalizability. Furthermore, we achieved a high Pearson correlation coefficient (PCC) value of 0.88, indicating its superior efficiency and accuracy between the predicted and actual heart rate values.