Proceedings of the 20th ACM International Conference on Multimodal Interaction 2018
DOI: 10.1145/3242969.3243014
|View full text |Cite
|
Sign up to set email alerts
|

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

Abstract: Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, desig… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
55
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 59 publications
(55 citation statements)
references
References 18 publications
0
55
0
Order By: Relevance
“…Neural networks, such as deep neural (DNN), recurrent neural network (RNN) and long short-term memory (LSTM) have been introduced to the field of speech recognition [7][8][9] and AV-ASR [1,[10][11][12][13][14][15][16]. End-to-end neural networks are challenging the dominance of HMM as a core technology.…”
Section: Related Workmentioning
confidence: 99%
“…Neural networks, such as deep neural (DNN), recurrent neural network (RNN) and long short-term memory (LSTM) have been introduced to the field of speech recognition [7][8][9] and AV-ASR [1,[10][11][12][13][14][15][16]. End-to-end neural networks are challenging the dominance of HMM as a core technology.…”
Section: Related Workmentioning
confidence: 99%
“…Experiments are reported on the TCD-TIMIT corpus [12], a very popular dataset in the field [18,[24][25][26][27][28][29][30][31]. The database contains audio-visual recordings of continuous speech by 62 speakers uttering 6913 phonetically-rich TIMIT sentences (6k word vocabulary) in studio-like conditions, concurrently recorded by two cameras providing frontal (0 o ) and near-frontal (30 o ) data at a 1920 × 1080-pixel resolution and 30 Hz frame-rate.…”
Section: Datasetmentioning
confidence: 99%
“…We provide additional system implementation details in Section 3, and we evaluate our developed networks in Section 4. Specifically, we study computational efficiency and VSR accuracy on the publicly available TCD-TIMIT corpus [12], a popular database for lipreading [18,[24][25][26][27][28] and other audio-visual speech processing tasks [29][30][31]. Our experiments show that our best model, a "MobiLipNetV2" with 3D pointwise convolutions, exhibits dramatically improved computational efficiency compared to both a baseline 3D-CNN and a state-of-theart ResNet, with no or minimal accuracy degradation.…”
Section: Introductionmentioning
confidence: 99%
“…Noda [24] makes use of CNN to extract visual feature and combines it with audio feature by a multi-stream HMM. Feature fusion and then using a one stream model is the dominant approach [25,20,26] in AVSR. [26] correlates every frame of acoustic feature with visual context feature acquired by cross modality attention.…”
Section: Introductionmentioning
confidence: 99%
“…Feature fusion and then using a one stream model is the dominant approach [25,20,26] in AVSR. [26] correlates every frame of acoustic feature with visual context feature acquired by cross modality attention. The correlated feature is then decoded by an attention based decoder.…”
Section: Introductionmentioning
confidence: 99%