2018
DOI: 10.1007/978-3-030-01225-0_32
|View full text |Cite
|
Sign up to set email alerts
|

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Abstract: Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
40
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 34 publications
(40 citation statements)
references
References 46 publications
0
40
0
Order By: Relevance
“…First, we use the mouthing-based sign spotting framework of [1] to identify sign locations corresponding to words that appear in the written How2Sign translations. This approach, which relies on the observation that signing sometimes makes use of mouthings in addition to head movements and manual gestures [56], employs the keyword spotting architecture of [53] with the improved P2G phoneme-to-grapheme keyword encoder proposed by Momeni et al [44]. We search for keywords from an initial candidate list of 12K words that result from applying text normalisation [22] to words that appear in How2Sign translations (to ensure that numbers and dates are converted to their written form, e.g.…”
Section: Iterative Enhancement Of Video Embeddingsmentioning
confidence: 99%
“…First, we use the mouthing-based sign spotting framework of [1] to identify sign locations corresponding to words that appear in the written How2Sign translations. This approach, which relies on the observation that signing sometimes makes use of mouthings in addition to head movements and manual gestures [56], employs the keyword spotting architecture of [53] with the improved P2G phoneme-to-grapheme keyword encoder proposed by Momeni et al [44]. We search for keywords from an initial candidate list of 12K words that result from applying text normalisation [22] to words that appear in How2Sign translations (to ensure that numbers and dates are converted to their written form, e.g.…”
Section: Iterative Enhancement Of Video Embeddingsmentioning
confidence: 99%
“…Large-scale continuous sign language datasets, on the other hand, are not exhaustively annotated [2,35]. The recent efforts of Albanie et al [2] scale up the automatic annotation of sign language data, and construct the BSL-1K dataset with the help of a visual keyword spotter [30,41] trained on lip reading to detect instances of mouthed words as a proxy for spotting signs. Sign spotting refers to a specialised form of sign language recognition in which the objective is to find whether and where a given sign has occurred within a sequence of signing.…”
Section: Related Workmentioning
confidence: 99%
“…Annotations from mouthings. Within the active signer segments produced by SDTRACK, we apply the sign spotting method proposed by [1] using the improved visual-only keyword spotting model of Stafylakis et al [36] from [28] (referred to in their paper as "P2G [36] baseline").…”
Section: The Seehear Datasetmentioning
confidence: 99%