Speech recognition is the most important research direction in human-computer interaction. It is the key to the connection between human beings and machines and the expression of intelligence and automation in the information society. Taking English as the research object, using the related knowledge of speech recognition, it is based on the hidden Markov model technology of deep learning and clustering analysis algorithm and evaluated according to the cross-language English phonemic recognition system of sparse autoencoder (SA) method. By studying the speech recognition algorithm of the English translation, the influence of the speech recognition environment on the accuracy of speech recognition is confirmed. This provides a direction for humans to study speech recognition at a deeper level. Based on the language model of Transformer and the language model based on Seq2Seq, it sets different vocabularies, and the data are collected in the laboratory and outdoors, respectively, and the posttest template library is formed after collection. In the task of restoring phonetic symbols to English characters when phonemes are modeling units, the error rate is the lowest. The error rate on the test set reached 9.54%, which was 6.97 percentage points higher than that of the syllable modeling unit.