Event detection acts an important role among modern society and it is a popular computer process that permits to detect the events automatically. Big data is more useful for the event detection due to large size of data. Multimodal event detection is utilized for the detection of events using heterogeneous types of data. This work aims to perform for classification of diverse events using Optimized Ensemble learning approach. The Multi-modal event data including text, image and audio are sent to the user devices from cloud or server where three models are generated for processing audio, text and image. At first, the text, image and audio data is processed separately. The process of creating a text model includes pre-processing using Imputation of missing values and data normalization. Then the textual feature extraction using integrated N-gram approach. The Generation of text model using Convolutional two directional LSTM (2DCon_LSTM). The steps involved in image model generation are pre-processing using Min-Max Gaussian filtering (MMGF). Image feature extraction using VGG-16 network model and generation of image model using Tweaked auto encoder (TAE) model. The steps involved in audio model generation are pre-processing using Discrete wavelet transform (DWT). Then the audio feature extraction using Hilbert Huang transform (HHT) and Generation of audio model using Attention based convolutional capsule network (Attn_CCNet). The features obtained by the generated models of text, image and audio are fused together by feature ensemble approach. From the fused feature vector, the optimal features are trained through improved battle royal optimization (IBRO) algorithm. A deep learning model called Convolutional duo Gated recurrent unit with auto encoder (C-Duo GRU_AE) is used as a classifier. Finally, different types of events are classified where the global model are then sent to the user devices with high security and offers better decision making process. The proposed methodology achieves better performances are Accuracy (99.93%), F1-score (99.91%), precision (99.93%), Recall (99.93%), processing time (17seconds) and training time (0.05seconds). Performance analysis exceeds several comparable methodologies in precision, recall, accuracy, F1 score, training time, and processing time. This designates that the proposed methodology achieves improved performance than the compared schemes. In addition, the proposed scheme detects the multi-modal events accurately.