2021
DOI: 10.48550/arxiv.2107.09817
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio Captioning Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 15 publications
0
10
0
Order By: Relevance
“…1(b). As both ASR and AAC systems output word sequences, state-of-theart AAC models also follow a Transformer based encoder-decoder framework, optimized using attention loss [15,14]. Note that the CTC loss or RNN-Transducers are not particularly applicable to the AAC task, because the token sequence in a caption need not be temporally aligned with the input spectrogram frames.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…1(b). As both ASR and AAC systems output word sequences, state-of-theart AAC models also follow a Transformer based encoder-decoder framework, optimized using attention loss [15,14]. Note that the CTC loss or RNN-Transducers are not particularly applicable to the AAC task, because the token sequence in a caption need not be temporally aligned with the input spectrogram frames.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
“…Automated audio captioning (AAC) aims to provide the information of constituent audio sources and events in a structured and easily comprehensible manner i.e., a natural language description of a given audio waveform [12,13]. Recently, Transformer based encoder-decoder frameworks are being employed to model the temporal structure of audio events [14,15] AAC is an emerging research area with several applications, such as enriching the raw textual information provided by ASR during television broadcasting and video streaming. Such integration of ASR and AAC tasks can potentially improve the viewing experience of the hearing impaired.…”
Section: Introductionmentioning
confidence: 99%
“…We reproduce the baselines without unavailable audio clips for a fair comparison with the proposed methods. Audio Captioning Transformer (ACT) [12] has an encoder-decoder structure from Transformer [13]. The encoder block of ACT is initialized with DeiT [14] which was trained for image classification tasks and then pre-trained on an audio tagging task using AudioSet.…”
Section: Datasetmentioning
confidence: 99%
“…Transformers have also been applied to some audio modeling tasks. For example, Transformerbased models have been used for audio classification (Gong et al, 2021;Verma & Berger, 2021), captioning (Mei et al, 2021), compression (Dieleman et al, 2021, speech recognition (Gulati et al, 2020), speaker separation (Subakan et al, 2021), and enhancement (Koizumi et al, 2021). Transformers have also been used for generative audio models Verma & Chafe, 2021), which in turn have enabled further tasks in music understanding (Castellon et al, 2021).…”
Section: Transformers For Sequence Modelingmentioning
confidence: 99%