Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1191
|View full text |Cite
|
Sign up to set email alerts
|

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences

Abstract: Assessment of many audio processing tasks relies on subjective evaluation which is time-consuming and expensive. Efforts have been made to create objective metrics but existing ones correlate poorly with human judgment. In this work, we construct a differentiable metric by fitting a deep neural network on a newly collected dataset of just-noticeable differences (JND), in which humans annotate whether a pair of audio clips are identical or not. By varying the type of differences, including noise, reverb, and co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
45
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 41 publications
(47 citation statements)
references
References 24 publications
2
45
0
Order By: Relevance
“…To assess the imperceptibility of the adversarial attacks, we use two perceptual audio metrics: Perceptual Evaluation of Speech Quality (PESQ) [12] and Just Noticeable Difference (JND) [28]. PESQ scores cover a scale from 1 (bad) to 5 (excellent) [12].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…To assess the imperceptibility of the adversarial attacks, we use two perceptual audio metrics: Perceptual Evaluation of Speech Quality (PESQ) [12] and Just Noticeable Difference (JND) [28]. PESQ scores cover a scale from 1 (bad) to 5 (excellent) [12].…”
Section: Resultsmentioning
confidence: 99%
“…PESQ scores cover a scale from 1 (bad) to 5 (excellent) [12]. JND is defined as the l1 norm of the difference between the representation of the original and adversarial audio files, computed by a neural network trained using pairs of audio files whose similarity was judged by humans [28]. Our preliminary experiments showed that JND scores are bounded between 0.0 (same audio file) and ∼ 5.0 (signal vs random noise).…”
Section: Resultsmentioning
confidence: 99%
“…The VGG distance between two audio samples is defined as the distance between their embeddings created by the VGGish network pre-trained on audio classification [28]. A recent work on speech processing shows that the distance between deep embeddings correlates better to human evaluation, compared to hand-crafted metrics such as Perceptual Evaluation of Speech Quality (PESQ) [60] and the Virtual Speech Quality Objective Listener (ViSQOL) [61], across various audio enhancement tasks including bandwidth extension [62]. The VGGish embeddings are also used in measuring the Fréchet Audio Distance (FAD), a state-of-theart evaluation method to assess the perceptual quality of a collection of output samples [63].…”
Section: Datasetmentioning
confidence: 99%
“…Alternatively, various differentiable metrics are proposed like Perceptual Metric for Speech Quality Evaluation (PMSQE) [4], Semi-supervised Speech Quality Assessment (SESQA) [5], and a metric based on Just Noticeable Differences (JNDs) [6]. In [5], authors proposed SESQA which is an eight-loss Multi-Task Learning (MTL) based semi-supervised framework for automated speech quality assessment.…”
Section: Introductionmentioning
confidence: 99%
“…In addition to predicting MOS, they define various auxiliary tasks including predicting JND, pairwise comparison, degradation strength, etc. [6] took a different approach by constructing a large corpus based on (binary) human preference in audio pairs. By capturing JND, they are able to train a differentiable metric well correlated with human MOS rating.…”
Section: Introductionmentioning
confidence: 99%