People are actively expressing their views and opinions via the use of visual pictures and text captions on social media platforms, rather than just publishing them in plain text as a consequence of technical improvements in this field. With the advent of visual media such as images, videos, and GIFs, research on the subject of sentiment analysis has expanded to encompass the study of social interaction and opinion prediction via the use of visuals. Researchers have focused their efforts on understanding social interaction and opinion prediction via the use of images, such as photographs, films, and animated GIFs (graphics interchange formats). The results of various individual studies have resulted in important advancements being achieved in the disciplines of text sentiment analysis and image sentiment analysis. It is recommended that future studies investigate the combination of picture sentiment analysis and text captions in more depth, and further research is necessary for this field. An intermodal analysis technique known as deep learning-based intermodal (DLBI) analysis is discussed in this suggested study, which may be used to show the link between words and pictures in a variety of scenarios. It is feasible to gather opinion information in numerical vector form by using the VGG network. Afterward, the information is transformed into a mapping procedure. It is necessary to predict future views based on the information vectors that have been obtained thus far, and this is accomplished through the use of active deep learning. A series of simulation tests are being conducted to put the proposed mode of operation to the test. When we look at the findings of this research, it is possible to infer that the model outperforms and delivers a better solution with more accuracy and precision, as well as reduced latency and an error rate, when compared to the alternative model (the choice).