2020
DOI: 10.1109/lsp.2020.3000968
|View full text |Cite
|
Sign up to set email alerts
|

Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Abstract: Integrating modalities, such as video signals with speech, has been shown to provide a standard quality and intelligibility for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources, which may complicate the respective SE. By contrast, a bone-conducted speech signal has a moderate data size while it manifests speech-phoneme structures, and thus complements its air-conducted counterpart, benefiting the enhancement. In this st… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
19
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
3

Relationship

3
6

Authors

Journals

citations
Cited by 39 publications
(19 citation statements)
references
References 47 publications
0
19
0
Order By: Relevance
“…Multimodal learning [25] aims to learn relating information from multiple modalities and fill the missing modality given the observed ones. Numerous research has investigated the effectiveness of incorporating different features into speechrelated systems, including text [27]- [30], videos [31]- [33], bone-conducted microphone signals [34], electropalatography [35], and articulatory movements [36]- [38].…”
Section: B Multimodal Learningmentioning
confidence: 99%
“…Multimodal learning [25] aims to learn relating information from multiple modalities and fill the missing modality given the observed ones. Numerous research has investigated the effectiveness of incorporating different features into speechrelated systems, including text [27]- [30], videos [31]- [33], bone-conducted microphone signals [34], electropalatography [35], and articulatory movements [36]- [38].…”
Section: B Multimodal Learningmentioning
confidence: 99%
“…Multimodal learning [25] aims to learn relating information from multiple modalities and fill the missing modality given the observed ones. Numerous research has investigated the effectiveness of incorporating different features into speechrelated systems, including text [27]- [30], videos [31]- [33], bone-conducted microphone signals [34], electropalatography [35], and articulatory movements [36]- [38].…”
Section: B Multimodal Learningmentioning
confidence: 99%
“…Another well-known advantage of DL models is that they can flexibly fuse data from different domains [64], [65]. Recently, researchers have tried to incorporate text [66], bone-conducted signals [67], and visual cues [68], [69], [70], [71], [72], [73] into speech applications as auxiliary and complementary information to achieve better performance. Among them, visual cues are the most common and intuitive because most devices can capture audio and visual data simultaneously.…”
Section: Introductionmentioning
confidence: 99%