Sign Language Production (SLP) aims to automatically translate a spoken language description to its corresponding sign language video. The core procedure of SLP is to transform sign gloss intermediaries into sign pose sequences (G2P). Most existing methods for G2P are based on sequential autoregression or sequence-tosequence encoder-decoder learning. However, by generating target pose frames conditioned on the previously generated ones, these models are prone to bringing issues such as error accumulation and high inference latency. In this paper, we argue that such issues are mainly caused by adopting autoregressive manner. Hence, we propose a novel Non-AuToregressive (NAT) model with a parallel decoding scheme, as well as an External Aligner for sequence alignment learning. Specifically, we extract alignments from the external aligner by monotonic alignment search for gloss duration prediction, which is used by a length regulator to expand the source gloss sequence to match the length of the target sign pose sequence for parallel sign pose generation. Furthermore, we devise a spatialtemporal graph convolutional pose generator in the NAT model to generate smoother and more natural sign pose sequences. Extensive experiments conducted on PHOENIX14T dataset show that our proposed model outperforms state-of-the-art autoregressive models in terms of speed and quality.
CCS CONCEPTS• Computing methodologies → Motion capture; Activity recognition and understanding.