Higher-level autonomous driving necessitates the best possible execution of important moves under all conditions. Most of the accidents in recent years caused by the AVs launched by leading automobile manufacturers are due to inadequate decision-making caused by their poor perceivance of environmental information. In todays, technology-bound scenarios, versatile sensors are used by AVs to collect environmental information. Due to various technical and natural calamities, the environmental information acquired by the sensors may not be complete and clear, due to which the AVs may misinterpret the information in a different context leading to inadequate decision making causing fatal accidents. To overcome this drawback effective preprocessing of raw sensory data is a mandatory task. Pre-processing the sensory data involves two vital tasks namely data cleaning and data fusion. Since the raw sensory data is complex and exhibits multimodal characteristics, more emphasis is given to data preprocessing. Since more innovative models have been proposed for data cleaning, this study focuses on data fusion. In particular, we propose a generic data fusion engine, which classifies different formats of sensory data and fuses them accordingly to improve accuracy. This study proposes a generic framework to fuse the text, image, and audio data. In the first stage of this research, an innovative hybrid model is proposed to fuse multispectral image and video data. Simple and efficient models to extract the salient image features are also proposed. The proposed models are evaluated using various standard metrics with existing popular models. The proposed image fusion model performed better than the other models.