2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01658
|View full text |Cite
|
Sign up to set email alerts
|

Read and Attend: Temporal Localisation in Sign Language Videos

Abstract: The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a largescale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 23 publications
(30 citation statements)
references
References 43 publications
0
30
0
Order By: Relevance
“…In this section, we investigate the application of our method for spotting mouthed words in sign language videos. This is an important application of visual KWS, as it has enabled an entire line of work on sign language recognition [5,44,62]. Data description & evaluation protocol.…”
Section: Mouthing Spotting In Sign Language Videosmentioning
confidence: 99%
“…In this section, we investigate the application of our method for spotting mouthed words in sign language videos. This is an important application of visual KWS, as it has enabled an entire line of work on sign language recognition [5,44,62]. Data description & evaluation protocol.…”
Section: Mouthing Spotting In Sign Language Videosmentioning
confidence: 99%
“…Several studies have sought to employ subtitles as weak supervision for learning to localise and classify signs, using apriori mining [17] and multiple-instance learning [6,7,46]. More recent work has leveraged cues such as mouthings [2] and visual dictionaries [42] and by making use of deep neural network features with sliding window classifiers [37] and attention learned via a proxy translation task [56]. In deviation from these works, our objective is to localise complete subtitle units, rather than individual signs.…”
Section: Related Workmentioning
confidence: 99%
“…Video features. The visual features are 1024-dimensional embeddings extracted from the I3D [13] sign classification model made publicly available by the authors of [56]. The features are pre-extracted over sign language video segments.…”
Section: Subtitle Aligner Transformermentioning
confidence: 99%
See 2 more Smart Citations