2014 IEEE International Conference on Robotics and Automation (ICRA) 2014
DOI: 10.1109/icra.2014.6907840
|View full text |Cite
|
Sign up to set email alerts
|

Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 13 publications
(8 citation statements)
references
References 13 publications
0
8
0
Order By: Relevance
“…Filler models are based on context independent phoneme or viseme, which is modeled with a 3-state HMM. Since the performance of conventional cascade strategy which performs visual re-scoring on the acoustic hypothesis in acoustic noisy conditions drops significantly, a parallel strategy is proposed to complementarily make the best use of two modalities as illustrated in Fig.5 [22]. Acoustic keyword searching and visual keyword searching is first conducted in parallel, generating acoustic keyword candidates and visual keyword candidates with corresponding log-likelihoods.…”
Section: Adaptive Decision Fusionmentioning
confidence: 99%
“…Filler models are based on context independent phoneme or viseme, which is modeled with a 3-state HMM. Since the performance of conventional cascade strategy which performs visual re-scoring on the acoustic hypothesis in acoustic noisy conditions drops significantly, a parallel strategy is proposed to complementarily make the best use of two modalities as illustrated in Fig.5 [22]. Acoustic keyword searching and visual keyword searching is first conducted in parallel, generating acoustic keyword candidates and visual keyword candidates with corresponding log-likelihoods.…”
Section: Adaptive Decision Fusionmentioning
confidence: 99%
“…This is also the case for works about multimodal intention recognition in other contexts than elderly assistance [12]. Some specialized approaches for audio-visual speech recognition [20], [21] considered uncertainty by performing uncertainty-based weighting for the fusion of multiple classifiers' outputs. In these works, the respective two categorical probability distributions returned by two individual classifiers for audio and visual input were combined by a weighted sum.…”
Section: Related Workmentioning
confidence: 99%
“…The input speech and lip movement video are recognized as a k* class using: For decision level fusion, the input speech and the lipmovement features are used to separately train an HMM classifier. The input speech and lip movement are classified together as c* class using [4]:…”
Section: Our Proposed Audio-visualmentioning
confidence: 99%
“…Liu et.al. proposed the mechanism of decision level fusion between the audio and visual features, and introduced a weighting scheme between each modality by means of reliability measures [4]. All of this research focuses on non-Thai commands.…”
Section: Introductionmentioning
confidence: 99%