Innovative engineering solutions that are efficient, quick, and simple to use are crucial given the rapid industrialization and technology breakthroughs in Industry 5.0. One of the areas receiving attention is the rise in gas leakage accidents at coal mines, chemical companies, and home appliances. To prevent harm to both the environment and human lives, rapid and automated detection and identification of the gas type is necessary. Most of the previous studies used a single mode of data to perform the detection process. However, instead of using a single source/mode, multimodal sensor fusion offers more accurate results. Furthermore, the majority used individual feature extraction approaches that extract either spatial or temporal information. This paper proposes a deep learning-based (DL) pipeline to combine multimodal data acquired via infrared (IR) thermal imaging and an array of seven metal oxide semiconductor (MOX) sensors forming an electronic nose (E-nose). The proposed pipeline is based on three convolutional neural networks (CNNs) models for feature extraction and bidirectional long-short memory (Bi-LSTM) for gas detection. Two multimodal data fusion approaches are used, including intermediate and multitask fusion. Discrete wavelet transform (DWT) is utilized in the intermediate fusion to combine the spatial features extracted from each CNN, providing spectral–temporal representation. In contrast, in multitask fusion, the discrete cosine transform (DCT) is used to merge all of the features obtained from the three CNNs trained with the multimodal data. The results show that the proposed fusion approach has boosted the gas detection performance reaching an accuracy of 98.47% and 99.25% for intermediate and multitask fusion, respectively. These results indicate that multitask fusion is superior to intermediate fusion. Therefore, the proposed system is capable of detecting gas leakage accurately and could be used in industrial applications.