Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and biosignals. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between features. In particular, it computes cross-attention weights based on the correlation between joint feature representations, and that of individual modalities. By deploying a joint A-V feature representation into the crossattention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results 1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on test set (validation set). This is a significant improvement over the baseline of third challenge of Affective Behavior Analysis in-the-wild (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).