Social media users internalise information in a multimodal context. Social media functions as a primary information source for disaster situational awareness encompassing texts, photographs, videos, and other multimodal information widely used in emergency management. Applying ensemble learning to social media sentiment analysis has garnered much scholarly attention, albeit with limited research on rescue and its sub-domain, which is characterised as a major complexity. A multimodal information categorisation model based on hierarchical feature extraction was proposed in this study. The information of multiple modes is first mapped to a unified text vector space in modelling the semantic content at the sentence and multimodal information levels in the multimodal information. Multiple deep learning (DL) models were subsequently applied to model the semantic content at the aforementioned levels. This study offers a BiLSTM-Attention-CNN-XGBOOST ensemble neural network model to acquire extensive multimodal information characteristics. Based on the empirical outcomes, this method precisely extracted multimodal information features with an accuracy exceeding 85% and 95% for Chinese-and English-language datasets, respectively.