The information used for human natural language comprehension is usually perceptual information, such as text, sounds, and images. In recent years, language models that learn semantics from single perceptual information sources (text) have gradually developed into multimodal language models that learn semantics from multiple perceptual information sources. Sound is perceptual information other than text that has been proven effective by many related works. However, there is still a need for further research on the incorporation method for perceptual information. Thus, this paper proposes a language model that synchronously trains dual perceptual information to enhance word representation. The representation is trained in a synchronized way that adopts an attention model to utilize both text and phonetic perceptual information in unsupervised learning tasks. On basis of that, these dual perceptual information is processed simultaneously, and that is similar with the cognitive process of human language understanding. The experiment results show that our approach achieve superior results in text classification and word similarity tasks with four languages of data set. INDEX TERMS Information representation, multi-layer neural network, natural language processing, unsupervised learning.