Real-time global event detection particularly catastrophic events has benefited significantly due to the ubiquitous adoption of social media platforms and advancements in image classification and natural language processing. Social media is a rich repository of multimedia content during disasters, encompassing reports on casualties, infrastructure damage, and information about missing individuals. While previous research has predominantly concentrated on textual or image analysis, the proposed study presents a multimodal middle fusion paradigm that includes Cross-modal attention and Self-attention to improve learning from both image and text modalities. Through rigorous experimentation, we validate the effectiveness of our proposed middle fusion paradigm in leveraging complementary information from both textual and visual sources.The proposed intermediate design outperforms current late and early fusion structures, achieving an accuracy of 91.53% and 91.07% in the informativeness and disaster type recognition categories, respectively. This study is among the few that examine all three tasks in the CrisisMMD dataset by combining textual and image analysis, demonstrating an approximate improvement of about 2% in prediction accuracy compared to similar studies on the same dataset.Additionally, ablation studies indicate that it outperforms the best-selected unimodal classifiers, with a 3-5% increase in prediction accuracies across various tasks. Thus, the method aims to bolster emergency response capabilities by offering more precise insights into evolving events.