Stress is one of the most severe concerns in modern life. High-level stress can create various diseases or loss of focus and productivity at work. Being under stress prevents people from recognizing their stress levels, so early stress detection is essential. Recently, multimodal fusion has enhanced the performance of stress detection models using Deep Learning (DL) techniques. The low, mid, and high-level features of a Convolutional Neural Network (CNN) are discriminative. A comprehensive feature representation can be obtained by fusing all three levels of CNN's features. This study mainly focuses on detecting stress by exploiting these advantages using a multimodal hierarchical CNN feature fusion. The two multimodal physiological signals used in this study are Electrodermal activity (EDA) and Electrocardiogram (ECG). We develop a hierarchical feature set by concatenating multi-level CNN features for each modality. Multimodal fusion on both hierarchical feature sets is performed using the Multimodal Transfer Module (MMTM). The experiments are carried out with raw frequency domain data and the features from the frequency bands to study the effectiveness of both. The model's performance is compared to the different combinations of hierarchical features from low, mid, and high levels. To verify the generalizability, the proposed approach has been evaluated on four benchmark datasets -ASCERTAIN, CLAS, MAUS, and WAUC. The proposed method showed its effectiveness by outperforming existing models by 1-2%, respectively, on frequency band features. It is observed that the hierarchical feature set from all three levels performed better than all other combinations by 2-4%. As a result, this strategy can be a useful addition to stress detection.