2020
DOI: 10.48550/arxiv.2001.00378
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

Siddique Latif,
Rajib Rana,
Sara Khalifa
et al.

Abstract: Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
39
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(39 citation statements)
references
References 250 publications
(323 reference statements)
0
39
0
Order By: Relevance
“…With the advent of deep learning, there has been a shift from traditional human-crafted emotional features such as those extracted by low level descriptors (LLDs) [44] or openSMILE [45], to the features automatically learned by deep neural networks (DNN) [46]. Many studies [46,47] have shown that the deep features learned by DNN are more effective and thus more suitable for SER. Meanwhile, recent speech synthesis studies [48,49] also propose to leverage those deep emotional features to characterize different emotional styles over a continuum [50].…”
Section: Speaker-dependent Emotional Stylementioning
confidence: 99%
“…With the advent of deep learning, there has been a shift from traditional human-crafted emotional features such as those extracted by low level descriptors (LLDs) [44] or openSMILE [45], to the features automatically learned by deep neural networks (DNN) [46]. Many studies [46,47] have shown that the deep features learned by DNN are more effective and thus more suitable for SER. Meanwhile, recent speech synthesis studies [48,49] also propose to leverage those deep emotional features to characterize different emotional styles over a continuum [50].…”
Section: Speaker-dependent Emotional Stylementioning
confidence: 99%
“…for downstream tasks. In speech representation learning (Latif et al, 2020), unsupervised techniques such as autoregressive modeling (Chung, Hsu, Tang and Glass, 2019;Chung and Glass, 2020a,b) and self-supervised modeling (Milde and Biemann, 2018;Tagliasacchi, Gfeller, Quitry and Roblek, 2019;Pascual, Ravanelli, Serrà, Bonafonte and Bengio, 2019) employ temporal context information for extracting speech representation. In our prior behavior modeling work, an unsupervised representative learning framework was proposed (Li, Baucom and Georgiou, 2017), which showed the promise of learning behavior representations based on the behavior stationarity hypothesis that nearby segments of speech share the same behavioral context.…”
Section: Related Work and Motivationmentioning
confidence: 99%
“…Recently, unsupervised and self-supervised learning (Latif, Rana, Khalifa, Jurdak, Qadir and Schuller, 2020;Chen, Kornblith, Norouzi and Hinton, 2020) have shown the benefits of using large amounts of unlabelled data to extract informative representations. Given the low availability of annotated behavioral data sets, representation learning through unsupervised ways can provide a promising avenue for behavioral modeling.…”
Section: Introductionmentioning
confidence: 99%
“…In the first case, the output representation is usually obtained via a mathematical projection rule, like principal component analysis [5] and linear discriminant analysis [6], or a specific extraction scheme making use of a set of pre-defined handcrafted rules [7], [8], [9], [10]. In contrast, the second case employs a machine learning algorithm, in order to discover salient features from raw data [11], [12], [13].…”
Section: Introductionmentioning
confidence: 99%
“…Due to the complex nature of real-world sensory inputs in present problems, the manual definition of descriptive features becomes increasingly unreliable, leaving no option but to engage machine learning schemes. Typical algorithms are Support Vector Machines (SVMs) [11] and Deep Neural Networks (DNNs) [13], [14], with DNNs forming the leading choice in feature extraction for cascade [15], [16] and fusion tasks [17], [18] given their proven efficacy over the past years. Yet, in order for a learning algorithm to ensure robust representation capacity, a set of techniques needs to be applied during its training phase.…”
Section: Introductionmentioning
confidence: 99%