Read and Attend: Temporal Localisation in Sign Language Videos

Varol, Gül; Momeni, Liliane; Albanie, Samuel; Afouras, Triantafyllos; Zisserman, Andrew

doi:10.1109/cvpr46437.2021.01658

Cited by 23 publications

(30 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we investigate the application of our method for spotting mouthed words in sign language videos. This is an important application of visual KWS, as it has enabled an entire line of work on sign language recognition [5,44,62]. Data description & evaluation protocol.…”

Section: Mouthing Spotting In Sign Language Videosmentioning

confidence: 99%

Visual Keyword Spotting with Attention

Prajwal

Momeni

Afouras

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we consider the task of spotting spoken keywords in silent video sequences -also known as visual keyword spotting. To this end, we investigate Transformerbased models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

show abstract

Section: Mouthing Spotting In Sign Language Videosmentioning

confidence: 99%

Visual Keyword Spotting with Attention

Prajwal

Momeni

Afouras

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Several studies have sought to employ subtitles as weak supervision for learning to localise and classify signs, using apriori mining [17] and multiple-instance learning [6,7,46]. More recent work has leveraged cues such as mouthings [2] and visual dictionaries [42] and by making use of deep neural network features with sliding window classifiers [37] and attention learned via a proxy translation task [56]. In deviation from these works, our objective is to localise complete subtitle units, rather than individual signs.…”

Section: Related Workmentioning

confidence: 99%

“…Video features. The visual features are 1024-dimensional embeddings extracted from the I3D [13] sign classification model made publicly available by the authors of [56]. The features are pre-extracted over sign language video segments.…”

Section: Subtitle Aligner Transformermentioning

confidence: 99%

“…Backbone pretraining. The I3D model is pretrained to perform 1064-way classification across the sign spotting instances with mouthings [2] and dictionary exemplars [42] (further details can be found in [56]). The model is then frozen and used to densely pre-extract visual features with stride 1 over the clips of the datasets.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Previous work exploiting such weakly-aligned data has mainly focused on finding sparse correspondences between keywords in the subtitle and individual signs [2,42,56], as opposed to localising the start and end times of a complete subtitle text in continuous signing. Though, as we show, localising isolated signs identified by keyword spotting nevertheless forms a useful pretraining task for full subtitle alignment.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations