Continuous sign language recognition (CSLR) is a very challenging task in intelligent systems, since it requires to produce real-time responses while performing computationally intensive video analytics and language modeling. Previous studies mainly focus on adopting hidden Markov models or recurrent neural networks with a limited capability to model specific sign languages, and the accuracy can drop significantly when recognizing the signs performed by different signers with non-standard gestures or non-uniform speeds. In this work, we develop a deep learning framework named SignBERT, integrating the bidirectional encoder representations from transformers (BERT) with the residual neural network (ResNet), to model the underlying sign languages and extract spatial features for CSLR. We further propose a multimodal version of SignBERT, which combines the input of hand images with an intelligent feature alignment, to minimize the distance between the probability distributions of the recognition results generated by the BERT model and the hand images. Experimental results indicate that when compared to the performance of alternative approaches for CSLR, our method has better accuracy with significantly lower word error rate on three challenging continuous sign language datasets.INDEX TERMS bidirectional encoder representations from transformers, continuous sign language recognition, deep learning, video analytics I. INTRODUCTION