Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10961
|View full text |Cite
|
Sign up to set email alerts
|

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 49 publications
(25 citation statements)
references
References 0 publications
0
25
0
Order By: Relevance
“…In the context of the ZRC series, a fair comparison equating architecture and dataset size would be necessary before claiming a definitive win. In addition, new models combining the two ideas like Masked AutoEncoders are emerging and need to be tested [86], [87].…”
Section: A Task 1: Acoustic Unit Discoverymentioning
confidence: 99%
“…In the context of the ZRC series, a fair comparison equating architecture and dataset size would be necessary before claiming a definitive win. In addition, new models combining the two ideas like Masked AutoEncoders are emerging and need to be tested [86], [87].…”
Section: A Task 1: Acoustic Unit Discoverymentioning
confidence: 99%
“…Concretely, for Type I and Type II attacks, as the adversary does not involve in the pre-training phase, we utilize the public MAE 3 and CAE 4 as our target model. This aligns with the threat model that attackers can only get access to the released models.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Compared with contrastive learning which aims to align different augmented views of the same image, MIM learns from predicting properties of masked patches from unmasked parts. It plays as a milestone that bridges the gap between visual and linguistic self-supervised pre-training methods, and has quickly emerged variants in applications such as images [3,5], video [25,29], audio [4], and graph [23]. However, as an iconic method settling in another branch of SSL, the associated security risks caused by the mask-and-predict mechanism and novel architectures of MIM are still unexplored.…”
Section: Introductionmentioning
confidence: 99%
“…It has been widely used in devices such as smart speakers and mobile phones for home safety or accessibility support [1]. Previous works have shown great progress on AEC using Convolutional Neural Network (CNN) [2,3,4,5] and Audio Spectrogram Transformer (AST) [6,7,8,9,10]. However, such models are usually computationally expensive and not suitable for edge devices (e.g., 86M model parameters for AST [6]).…”
Section: Introductionmentioning
confidence: 99%