During the past decade, social media platforms have been extensively used during a disaster for information dissemination by the affected community and humanitarian agencies. Although many studies have been done recently to classify the informative and non-informative messages from social media posts, most are unimodal, i.e., have independently used textual or visual data to build the deep learning models. In the present study, we integrate the complementary information provided by the text and image messages about the same event posted by the affected community on the social media platform Twitter and build a multimodal deep learning model based on the concept of attention mechanism. The attention mechanism is a recent breakthrough that has revolutionized the field of deep learning. Just as humans pay more attention to a specific part of the text or image, ignoring the rest, neural networks can also be trained to concentrate on more relevant features through the attention mechanism. We propose a novel Cross-Attention Multi-Modal (CAMM) deep neural network for classifying multimodal disaster data, which uses the attention mask of the textual modality to highlight the features of the visual modality. We compare CAMM with unimodal models and the most popular bilinear multimodal models, MUTAN and BLOCK, generally used for visual question answering. CAMM achieves an average F1-score of 84.08%, better than the MUTAN and BLOCK methods by 6.31% and 5.91%, respectively. The proposed crossattention-based multimodal deep learning method outperforms the current state-of-the-art fusion methods on the benchmark multimodal disaster dataset by highlighting the more relevant cross-domain features of text and image tweets. This study affirms that social media platforms become a rich source of multimodal data during a disaster. This data can be utilized to build automated tools for quick filtration of informative messages to assess the post-disaster needs of the affected community and provide timely help.