Existing methods for monitoring internet public opinion rely primarily on regular crawling of textual information on web pages but cannot quickly and accurately acquire and identify textual information in images and videos and discriminate sentiment. The problems make this a challenging research point for multimodal information detection in an internet public opinion scenario. In this paper, we look at how to dynamically monitor the internet opinion information (mostly images and videos) that different websites post. Based on the most recent advancements in text recognition, this paper proposes a new method of visual multimodal text recognition and sentiment analysis (MTR-SAM) for internet public opinion analysis scenarios. In the detection module, a LK-PAN network with large sensory fields is proposed to enhance the CML distillation strategy, and an RSE-FPN with a residual attention mechanism is used to improve feature map representation. Second, it proposes that the original CTC decoder be replaced with a GTC method to solve earlier problems with text detection at arbitrary rotation angles. Additionally, the performance of scene text detection for arbitrary rotation angles is improved using a sinusoidal loss function for rotation recognition. Finally, the improved sentiment analysis model is used to predict the sentiment polarity of the text recognition results. The experimental results show that the new method proposed in this paper improves recognition speed by 31.77%, recognition accuracy by 10.78% on the video dataset, and the F1 score of the multimodal sentiment analysis model by 4.42% on the self-built internet public opinion dataset (lab dataset). The method proposed provides significant technical support for internet public opinion analysis in multimodal domains.