Biometric authentication is a widely used method for verifying individuals’ identities using photoplethysmography (PPG) cardiac signals. The PPG signal is a non-invasive optical technique that measures the heart rate, which can vary from person to person. However, these signals can also be changed due to factors like stress, physical activity, illness, or medication. Ensuring the system can accurately identify and authenticate the user despite these variations is a significant challenge. To address these issues, the PPG signals were preprocessed and transformed into a 2-D image that visually represents the time-varying frequency content of multiple PPG signals from the same human using the scalogram technique. Afterward, the features fusion approach is developed by combining features from the hybrid convolution vision transformer (CVT) and convolutional mixer (ConvMixer), known as the CVT-ConvMixer classifier, and employing attention mechanisms for the classification of human identity. This hybrid model has the potential to provide more accurate and reliable authentication results in real-world scenarios. The sensitivity (SE), specificity (SP), F1-score, and area under the receiver operating curve (AUC) metrics are utilized to assess the model’s performance in accurately distinguishing genuine individuals. The results of extensive experiments on the three PPG datasets were calculated, and the proposed method achieved ACCs of 95%, SEs of 97%, SPs of 95%, and an AUC of 0.96, which indicate the effectiveness of the CVT-ConvMixer system. These results suggest that the proposed method performs well in accurately classifying or identifying patterns within the PPG signals to perform continuous human authentication.