2022
DOI: 10.3390/s22207738
|View full text |Cite
|
Sign up to set email alerts
|

Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

Abstract: Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recogniti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 49 publications
0
7
0
Order By: Relevance
“…In addition, when comparing A + V ( L + S ), A , and V , it improved by 2.228% and 2.305%, respectively, and we demonstrated that adding the log‐Mel spectrogram of speech to visual information helped improve performance in noisy environments. Unlike the previous OCSR API [31, 35], which exhibited excellent performance only in certain environments, it exhibited similar performance in all nine noise environments and an evenly stable performance overall.…”
Section: Resultsmentioning
confidence: 81%
See 4 more Smart Citations
“…In addition, when comparing A + V ( L + S ), A , and V , it improved by 2.228% and 2.305%, respectively, and we demonstrated that adding the log‐Mel spectrogram of speech to visual information helped improve performance in noisy environments. Unlike the previous OCSR API [31, 35], which exhibited excellent performance only in certain environments, it exhibited similar performance in all nine noise environments and an evenly stable performance overall.…”
Section: Resultsmentioning
confidence: 81%
“…Depending on the composition of the auditory and visual information, seven models were evaluated as A, V(L) for visual information that used only lip information, V(L + S) for visual information using lip and log-Mel spectrogram information, and A + V(L) for auditory and lip and log-Mel spectrogram information. The use of only the auditory data (black dotted lines) of the single-modal approach exhibited approximately 3%-4% WER performance in nine noisy environments, and the OCSR API, which performed well only in certain environments, such as indoors or vehicles, was better than the quantitative performance shown in two previous studies [31,35]. Therefore, in the clean environment listed in Table S4, the OCSR API exhibited an accuracy of 3.939 ± 0.313% (Table S5).…”
Section: Performance In Noisy Environmentsmentioning
confidence: 87%
See 3 more Smart Citations