In recent times, multimodal sentiment analysis is the most researched topic, due to the availability of huge amount of multimodal content. Generally, multimodal sentiment analysis uses text, audio and visual representations for effective sentiment recognition. The detection of sentiment in the natural language is a tricky process even for humans, so making it automation is more complicated. In this article, the input multimodal data is collected from Surrey Audio-Visual Expressed Emotion (SAVEE) and YouTube datasets, and then hybrid feature extraction is performed to extract feature vectors from the different modalities like textual, audio and visual. The Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) techniques are applied to extract features from textual modality. Further, AlexNet, spectral centroid features, spectral flux features, and short term energy are employed to extract features from the visual and audio modalities. The extracted active feature values are given as the input to the Modified Ant Lion Optimizer (MALO) based Long Short Term Memory (LSTM) network for sentiment classification. In the MALO algorithm, two new processes are performed to select optimal hyper-parameters of LSTM network that improves the training and testing mechanism and reduces the computational complexity. The results with the MALO-LSTM model obtained were 98.62% and 98.81% of accuracy on YouTube and SAVEE datasets, relatively higher than some of the existing methods, the proposed approach provided comparatively better performance.