2018 IEEE International Symposium on Multimedia (ISM) 2018
DOI: 10.1109/ism.2018.00-19
|View full text |Cite
|
Sign up to set email alerts
|

MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
27
0
1

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(28 citation statements)
references
References 26 publications
0
27
0
1
Order By: Relevance
“…This method is extended in [9] by adding optical flow information as input to the network and by adding a postprocessing step, where generated sound features are replaced by their closest match from the training set. A similar method that uses multi-view visual feeds has been proposed in [10]. Finally, Akbari et.…”
Section: Introductionmentioning
confidence: 99%
“…This method is extended in [9] by adding optical flow information as input to the network and by adding a postprocessing step, where generated sound features are replaced by their closest match from the training set. A similar method that uses multi-view visual feeds has been proposed in [10]. Finally, Akbari et.…”
Section: Introductionmentioning
confidence: 99%
“…Despite much research in lipreading domain, it is still seen as a classification task in which, given some silent videos, a model has to classify those videos into a limited and fixed size of lexicon (Lucey and Potamianos 2006;Ngiam et al 2011;Lee, Lee, and Kim 2016;Zimmermann et al 2016;Assael et al 2016;Chung et al 2016;Petridis et al 2017; Chung and Zisserman 2017; Shah and Zimmermann 2017). There have also been a few works on speech-reconstruction as well (Cornu and Milner 2015;Kumar et al 2018a;2018b). However, the problem of view and pose-variation has been dealt by a very few lipreading systems (Zhou et al 2014).…”
Section: Related Workmentioning
confidence: 99%
“…One of such datasets is Oulu-VS2 (Anina et al 2015) which provides five different views of speakers shot concurrently. On this dataset, combination of multiple poses was tried for speechreading by (Petridis et al 2017) and for speech-reconstruction by (Kumar et al 2018a;2018b). Given visual feeds from multiple cameras, the authors showed that combining multiple views would result in better accuracy in speechreading (Lucey and Potamianos 2006;Zimmermann et al 2016;Lee, Lee, and Kim 2016;Petridis et al 2017) and better speech quality (Kumar et al 2018a;2018b) for speech-reconstruction.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations