2022
DOI: 10.48550/arxiv.2202.01405
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Joint Speech Recognition and Audio Captioning

Abstract: Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most endto-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…The system relies on multi-task learning (MTL) and has been effective in accent classification and speech recognition tasks. Based on MTL, the approach showed significant improvements in WER, ranging from 17.25 to 59.90, over singletask baseline models 15,16,17 .…”
Section: Related Workmentioning
confidence: 98%
“…The system relies on multi-task learning (MTL) and has been effective in accent classification and speech recognition tasks. Based on MTL, the approach showed significant improvements in WER, ranging from 17.25 to 59.90, over singletask baseline models 15,16,17 .…”
Section: Related Workmentioning
confidence: 98%
“…They proposed an adversarial training framework based on generative adversarial network (GAN) [63] to encourage the diversity of audio captioning systems. In addition, Narisetty et al [64] proposed approaches for end-to-end joint modeling of speech recognition and audio captioning tasks.…”
Section: Other Approachesmentioning
confidence: 99%