Towards Fast and High-Quality Sign Language Production

Huang, Wen‐Can; Pan, Wenwen; Zhao, Zhou

doi:10.1145/3474085.3475463

Cited by 16 publications

(19 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is worth noting that, as stated by (Huang et al, 2021), the proposed decoding mechanism provides weak supervisions with the initial ground-truth frame and guided counter sequences during the inference time.…”

Section: Progressive Transformer Baselinementioning

confidence: 99%

Modeling Intensification for Sign Language Generation: A Computational Approach

İnan¹,

Zhong²,

Hassan³

et al. 2022

Preprint

View full text Add to dashboard Cite

End-to-end sign language generation models do not accurately represent the prosody in sign language. A lack of temporal and spatial variations leads to poor-quality generated presentations that confuse human interpreters. In this paper, we aim to improve the prosody in generated sign languages by modeling intensification in a data-driven manner. We present different strategies grounded in linguistics of sign language that inform how intensity modifiers can be represented in gloss annotations. To employ our strategies, we first annotate a subset of the benchmark PHOENIX-14T, a German Sign Language dataset, with different levels of intensification. We then use a supervised intensity tagger to extend the annotated dataset and obtain labels for the remaining portion of it. This enhanced dataset is then used to train state-of-the-art transformer models for sign language generation. We find that our efforts in intensification modeling yield better results when evaluated with automatic metrics. Human evaluation also indicates a higher preference of the videos generated using our model.

show abstract

Section: Progressive Transformer Baselinementioning

confidence: 99%

Modeling Intensification for Sign Language Generation: A Computational Approach

İnan¹,

Zhong²,

Hassan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Further, Saunders et al [6] proposed a spatial-temporal skeletal graph attention layer that embeds a hierarchical body inductive bias into the self-attention mechanism. Huang et al [4] developed spatial-temporal graph convolution layers into the pose generator which is able to capture both intraframe and inter-frame information of sign language videos. However, all these methods disregard each joint has different contributions to gestures expression.…”

Section: B Sign Language Productionmentioning

confidence: 99%

“…Recently, Transformer-based methods [1], [2], [3], [4], [5] became the most widespread methods to produce skeletons for SLP. However, there is still a problem in these works: such architecture always ignores the structural relationships of the human skeletons, by which poor performance would be obtained.…”

mentioning

confidence: 99%

“…However, there is still a problem in these works: such architecture always ignores the structural relationships of the human skeletons, by which poor performance would be obtained. Thereupon, the existing SLP method [4] devises a spatial-temporal graph convolution (GCN) as pose generator which implemented from a standard 2D convolution. Skeletal graph self-attention [6] encodes the spatio-temporal connectivity into the node features while calculating attention matrices.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Spatial–Temporal Graph Transformer With Sign Mesh Regression for Skinned-Based Sign Language Production

Cui

Chen

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Sign language production aims to automatically generate coordinated sign language videos from spoken language. As a typical sequence to sequence task, the existing methods are mostly to regard the skeletons as a whole sequence, however, those do not take the rich graph information among both joints and edges into consideration. In this paper, we propose a novel method named Spatial-Temporal Graph Transformer (STGT) to deal with this problem. Specifically, according to kinesiology, we first design a novel graph representation to achieve graph features from skeletons. Then the spatial-temporal graph selfattention utilizes graph topology to capture the intra-frame and inter-frame correlations, respectively. Our key innovation is that the attention maps are calculated on both spatial and temporal dimensions in turn, meanwhile, graph convolution is used to strengthen the short-term features of skeletal structure. Finally, due to the generated skeletons are based on the form of skeleton points and lines so far. In order to visualize the generated sign language videos, we design a sign mesh regression module to render the skeletons into skinned animations including body and hands posture. Comparing with states of art baseline on RWTH-PHONEIX Weather-2014T in Experiment Section, STGT can obtain the highest values on BLEU and ROUGE, which indicates our method produces most accurate and intuitive sign language videos.

show abstract

“…Recently, there have been many deep learning approaches to SLP proposed [23,42,48,50,52,56,63,71], with Saunders et al achieving state-of-the-art results with gloss supervision [52]. These works predominantly represent sign languages as sequences of skeletal frames, with each frame encoded as a vector of joint coordinates [51] that disregards any spatio-temporal structure available within a skeletal representation.…”

Section: Related Workmentioning

confidence: 99%

Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into Sign Language Production

Saunders¹,

Camgöz²,

Bowden³

2021

Preprint

View full text Add to dashboard Cite

Recent approaches to Sign Language Production (SLP) have adopted spoken language Neural Machine Translation (NMT) architectures, applied without sign-specific modifications. In addition, these works represent sign language as a sequence of skeleton pose vectors, projected to an abstract representation with no inherent skeletal structure.In this paper, we represent sign language sequences as a skeletal graph structure, with joints as nodes and both spatial and temporal connections as edges. To operate on this graphical structure, we propose Skeletal Graph Self-Attention (SGSA), a novel graphical attention layer that embeds a skeleton inductive bias into the SLP model. Retaining the skeletal feature representation throughout, we directly apply a spatio-temporal adjacency matrix into the self-attention formulation. This provides structure and context to each skeletal joint that is not possible when using a non-graphical abstract representation, enabling fluid and expressive sign language production. We evaluate our Skeletal Graph Self-Attention architecture on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, achieving state-of-the-art back translation performance with an 8% and 7% improvement over competing methods for the dev and test sets.

show abstract

Towards Fast and High-Quality Sign Language Production

Cited by 16 publications

References 41 publications

Modeling Intensification for Sign Language Generation: A Computational Approach

Modeling Intensification for Sign Language Generation: A Computational Approach

Spatial–Temporal Graph Transformer With Sign Mesh Regression for Skinned-Based Sign Language Production

Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into Sign Language Production

Contact Info

Product

Resources

About