Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Stafylakis, Themos; Tzimiropoulos, Georgios

doi:10.1007/978-3-030-01225-0_32

Cited by 34 publications

(40 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we use the mouthing-based sign spotting framework of [1] to identify sign locations corresponding to words that appear in the written How2Sign translations. This approach, which relies on the observation that signing sometimes makes use of mouthings in addition to head movements and manual gestures [56], employs the keyword spotting architecture of [53] with the improved P2G phoneme-to-grapheme keyword encoder proposed by Momeni et al [44]. We search for keywords from an initial candidate list of 12K words that result from applying text normalisation [22] to words that appear in How2Sign translations (to ensure that numbers and dates are converted to their written form, e.g.…”

Section: Iterative Enhancement Of Video Embeddingsmentioning

confidence: 99%

Sign Language Video Retrieval with Free-Form Textual Queries

Duarte¹,

Albanie²,

Giró-i-Nieto³

et al. 2022

Preprint

View full text Add to dashboard Cite

Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form 1 textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query.We propose to tackle this task by learning crossmodal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.

show abstract

Section: Iterative Enhancement Of Video Embeddingsmentioning

confidence: 99%

Sign Language Video Retrieval with Free-Form Textual Queries

Duarte¹,

Albanie²,

Giró-i-Nieto³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Large-scale continuous sign language datasets, on the other hand, are not exhaustively annotated [2,35]. The recent efforts of Albanie et al [2] scale up the automatic annotation of sign language data, and construct the BSL-1K dataset with the help of a visual keyword spotter [30,41] trained on lip reading to detect instances of mouthed words as a proxy for spotting signs. Sign spotting refers to a specialised form of sign language recognition in which the objective is to find whether and where a given sign has occurred within a sequence of signing.…”

Section: Related Workmentioning

confidence: 99%

Read and Attend: Temporal Localisation in Sign Language Videos

Varol

Momeni

Albanie

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a largescale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.

show abstract

“…Annotations from mouthings. Within the active signer segments produced by SDTRACK, we apply the sign spotting method proposed by [1] using the improved visual-only keyword spotting model of Stafylakis et al [36] from [28] (referred to in their paper as "P2G [36] baseline").…”

Section: The Seehear Datasetmentioning

confidence: 99%

SeeHear: Signer Diarisation and a New Dataset

Albanie

Varol

Momeni

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we propose a framework to collect a large-scale, diverse sign language dataset that can be used to train automatic sign language recognition models.The first contribution of this work is SDTRACK, a generic method for signer tracking and diarisation in the wild. Our second contribution is SEEHEAR, a dataset of 90 hours of British Sign Language (BSL) content featuring a wide range of signers, and including interviews, monologues and debates. Using SDTRACK, the SEEHEAR dataset is annotated with 35K active signing tracks, with corresponding signer identities and subtitles, and 40K automatically localised sign labels. As a third contribution, we provide benchmarks for signer diarisation and sign recognition on SEEHEAR.

show abstract

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Cited by 34 publications

References 46 publications

Sign Language Video Retrieval with Free-Form Textual Queries

Sign Language Video Retrieval with Free-Form Textual Queries

Read and Attend: Temporal Localisation in Sign Language Videos

SeeHear: Signer Diarisation and a New Dataset

Contact Info

Product

Resources

About