Semantic analysis, a crucial aspect of natural language processing, encounters numerous practical challenges due to the limitations of its current technology. Therefore, this paper enhances the traditional semantic analysis technology by developing a frame recognition model that integrates syntactic and semantic roles, a text semantic feature extraction model, and an audio/video information extraction model with a multimodal inter-modal cross-attention mechanism. These models are then integrated to jointly construct an improved model for semantic analysis, which is based on deep neural networks. The paper examines the model’s improvement effect in semantic role labeling, text classification, and information extraction. The F1 values of this paper’s model on the Wall Street Journal and Brown test sets are 90.4% and 81.4%, respectively, which are the highest semantic role recognition annotation accuracy rates. The HL, P, R, and F1 values of this paper’s model on the three datasets, on the other hand, are the best results among all models, and it has the best text categorization effect. This paper’s model has a 95.3% accuracy rate in detecting theme subtitles. The recognition accuracy of simple and complex backgrounds is 95.7% and 94.1%, respectively. After the information extraction method of this paper’s model underwent error correction, the accuracy of ASR recognition increased by 18.55%.