Aligning Subtitles in Sign Language Videos

Bull, Hannah; Afouras, Triantafyllos; Varol, Gül; Albanie, Samuel; Momeni, Liliane; Zisserman, Andrew

doi:10.1109/iccv48922.2021.01135

Cited by 15 publications

(11 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the advancement of computer vision techniques, there is increasing attention on collecting real-life SLT datasets. Many such datasets (Camgoz et al, 2018(Camgoz et al, , 2021Albanie et al, 2021) are drawn from TV programs accompanied by sign language interpretation. Despite being highly realistic compared to studio datasets, they are generally limited to a specific domain.…”

Section: Datasets For Sltmentioning

confidence: 99%

“…For example, the popular Phoenix-2014T DGS-German benchmark contains signed German weather forecasts and includes only 11 hours of signing videos from 9 signers. The largest real-world sign language corpus we are aware of is BOBSL (Albanie et al, 2021), which consists of 1,467 hours of BBC broadcasts from 39 signers interpreted into British Sign Language (BSL). However, access to the videos is restricted, and the data cannot be used by independent researchers or commercial organizations.…”

Section: Datasets For Sltmentioning

confidence: 99%

“…Two key components of our proposed approach are searching for coarticulated signs from videosentence pairs and fusing multiple local visual fea-tures. There has been significant amount of prior work (Buehler et al, 2009;Albanie et al, 2020;Varol et al, 2021;Momeni et al, 2020;Shi et al, 2022a) devoted to spotting signs in real-world sign language videos. In contrast to this prior work where sign search is the end goal, here we treat sign spotting as a pretext task in the context of SLT.…”

Section: Other Related Workmentioning

confidence: 99%

“…One feature of our data is the use of subtitles associated with the video as the English translation, thus saving effort on human annotation. Subtitled videos have also been employed in prior work (Camgoz et al, 2021;Albanie et al, 2021) for constructing sign language datasets. As prior work has mostly focused on interpreted signing videos where content originally in the spoken language is interpreted into sign language, the subtitles are naturally aligned to the audio instead of the signing stream.…”

Section: The Openasl Datasetmentioning

confidence: 99%

See 3 more Smart Citations

Open-Domain Sign Language Translation Learned from Online Video

Shi¹,

Brentari²,

Shakhnarovich³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing work on sign language translationthat is, translation from sign language videos into sentences in a written language-has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale ASL-English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in various domains (news, VLOGs, etc.) from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models based on prior work. 1

show abstract

Section: Datasets For Sltmentioning

confidence: 99%

Section: Datasets For Sltmentioning

confidence: 99%

Section: Other Related Workmentioning

confidence: 99%

Section: The Openasl Datasetmentioning

confidence: 99%

See 2 more Smart Citations

Open-Domain Sign Language Translation Learned from Online Video

Shi¹,

Brentari²,

Shakhnarovich³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Visual grounding. Our work is also related to tasks such as natural language grounding in videos [14,24,25,29,40,68,71,72] and subtitle alignment in sign language clips [12]. Transformers.…”

Section: Related Workmentioning

confidence: 99%

Visual Keyword Spotting with Attention

Prajwal

Momeni

Afouras

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we consider the task of spotting spoken keywords in silent video sequences -also known as visual keyword spotting. To this end, we investigate Transformerbased models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

show abstract