Proceedings of the First International Workshop on Natural Language Processing Beyond Text 2020
DOI: 10.18653/v1/2020.nlpbt-1.2
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Speech Recognition with Unstructured Audio Masking

Abstract: Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called Rand-WordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 24 publications
0
4
0
Order By: Relevance
“…According to children learn language according to reinforcement principles by associating words with meaning (Skinner et al, 1957). Based on their research, the authors can show that language acquisition is a group based on the basic interpretation of knowledge through environmental knowledge, creating knowledge and knowledge through experience.…”
Section: Related Workmentioning
confidence: 99%
“…According to children learn language according to reinforcement principles by associating words with meaning (Skinner et al, 1957). Based on their research, the authors can show that language acquisition is a group based on the basic interpretation of knowledge through environmental knowledge, creating knowledge and knowledge through experience.…”
Section: Related Workmentioning
confidence: 99%
“…We explore a knowledge adapter to address this challenge, which can be used as a plugin without changing the original SLU structure. Inspired by Srinivasan et al (2020), we adopt the hierarchical attention fusion mechanism (Luong, Pham, and Manning 2015;Libovický and Helcl 2017) as the knowledge adapter, which has the advantage of dynamically considering relevant supporting information for different words. Specifically, given the query vector q and the corresponding supporting information…”
Section: Multi-level Knowledge Adaptermentioning
confidence: 99%
“…We explore a knowledge adapter to address this challenge, which can be used as a plugin without changing the original SLU structure. Inspired by Srinivasan et al (2020), we adopt the hierarchical attention fusion mechanism (Luong, Pham, and Manning 2015;Libovický and Helcl 2017) as the knowledge adapter, which has the advantage of dynamically considering relevant supporting information for different words. Specifically, given the query vector q and the corresponding supporting information H Info = [h KG ; h UP ; h CA ] ∈ R 3×di , we obtain the updated representation q = Knowledge-Adapter(q, H info ) by weighted summing the representation from all the supporting information:…”
Section: Multi-level Knowledge Adaptermentioning
confidence: 99%