The role of social media in crisis response and recovery is becoming increasingly prominent due to the rapid progression of information and communication technologies. This study introduces a transformative approach to extract valuable information from the enormous volume of user-generated content on social media, specifically focusing on tweets that can significantly aid emergency response and recovery efforts. The identification of informative tweets allows emergency personnel to gain a more comprehensive understanding of crisis situations, thereby facilitating the deployment of more effective recovery strategies. Previous studies have largely focused on either the textual content or the accompanying visual elements within tweets. However, evidence suggests a complementary relationship between text and visuals, offering an opportunity for synergistic insights. In response to this, a novel deep learning framework is proposed, which concurrently analyses both textual and visual components extracted from user-generated tweets. The central architecture integrates established methodologies, including RoBERTa for text analysis, Vision Transformer for image understanding, Bi-LSTM for sequence processing, and an attention mechanism for context awareness. The innovation of this approach lies in its emphasis on multimodal fusion, introducing rank fusion techniques to effectively combine the strengths of textual and visual inputs. The proposed methodology is extensively tested across seven diverse datasets, representing various natural disasters such as wildfires, hurricanes, earthquakes, and floods. The experimental results demonstrate a superior performance of the proposed system, compared to several existing methods, with accuracy levels ranging from 94% to 98%. These findings underscore the efficacy of the proposed deep learning classifier in leveraging interactions across multiple modalities. In summary, this study contributes to disaster management by promoting a comprehensive approach that exploits the potential of multimodal data, thereby enhancing decision-making processes in emergency scenarios.