Optical music recognition technology is of great significance in the development of the digital music. In recent years, the convolutional recurrent neural network framework with connectionist temporal classification has been used in music recognition. However, its loss function is calculated in serial mode, which leads to low efficiency in training and difficulty in convergence. Additionally, because of the gradient disappearance of excessive long music sequences, the existing music recognition models are hard to learn the relationships between musical symbols, resulting in high sequence error rate. Therefore, we propose a sequence-to-sequence framework based on transformer with masked language model to deal with these problems. The context representation between musical symbols can be captured further by the self-attention module in the transformer, which will reduce the sequence error rate. In addition, we refer to the masked language model and design a mask matrix to predict each musical symbol in a parallel way, so as to speed up the training process. Our experiments are carried out on the printed images of music stave dataset, and the results show that our proposed method is training-efficient and has great improvement in sequence accuracy rate.