Speech recognition has made breakthrough progress and been widely used. Along with the development of speech recognition, new requirements are constantly put forward. First, acoustic parameters are related to the natural attributes of speakers; second, the calculation of acoustic parameters depends on a large range of corpus resources; and in the aspects of language recognition, speaker recognition, speech visualization and automatic speech annotation, more effort needs to be put into research. English contains 48 phonemes, and the correct recognition of phonemes is an important basis for the analysis and study of the acoustic characteristics of continuous intonation. In this paper, the convolutional neural network is first used to extract visual features of different scales, and the image features of different scales are fused effectively, so that the fused feature vector contains more detailed image information, and effectively alleviates the problem of image information loss. Then, an intonation acoustic feature recognition model based on attention mechanism is constructed, which takes into account the early and late fusion of features and improves the effectiveness of information fusion. The experimental results show that the training error of the model in this paper decreases gradually with the increase of the number of iterations and tends to be stable after 1000 iterations. The model basically converges and has reliability and feasibility. In the phoneme recognition experiment, for sentences with more phonemes and sentences with fewer phonemes, the recognition rate of the model in this paper is more than 60% and the loss rate is less than 5%, and about 60 phonemes can be recognized per minute. Therefore, the model presented in this paper improves the results of English intonation acoustic feature recognition to a certain extent, which is successful.