2022
DOI: 10.1109/taslp.2022.3171965
|View full text |Cite
|
Sign up to set email alerts
|

ISNet: Individual Standardization Network for Speech Emotion Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 32 publications
(18 citation statements)
references
References 48 publications
0
18
0
Order By: Relevance
“…This facilitates the capturing of long-distance dependencies within audio signals, thereby improving the extraction of both local and high-level features, Table 1 The variation in computational efficiency and performance of TFC-SpeechFormer and Transformer using HCE across different datasets.The symbols "(+)" indicate improvement, while "(-)" indicates a decrease. IEMOCAP STC [30] 0.613 0.604 0.617 LSTM-GIN [31] 0.647 0.655 -CA-MSER [32] 0.698 0.711 -SpeechFormer [24] 0.629 0.645 -ISNet [33] 0.704 0.65 -DST [22] 0.718 0.736 -ShiftSER [34] 0.721 0.727 -TFC-SpeechFormer(Ours) 0.746 0.751 0.743 DAIC-WOZ FVTC-CNN [35] 0.735 0.656 0.64 Saidi [36] 0.68 0.68 0.68 EmoAudioNet [37] 0.732 0.649 0.653 Solieman [38] 0.66 0.615 0.61 SIMSIAM-S [39] 0.703 --TOAT [40] 0.717 0.429 0.48 SpeechFormer [24] 0.686 0.65 0.694 TFC-SpeechFormer(Ours) 0.762 0.701 0.714 consequently enhancing the feature representation capability. This, in turn, enables the model to better understand the emotional implications within speech signals.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
“…This facilitates the capturing of long-distance dependencies within audio signals, thereby improving the extraction of both local and high-level features, Table 1 The variation in computational efficiency and performance of TFC-SpeechFormer and Transformer using HCE across different datasets.The symbols "(+)" indicate improvement, while "(-)" indicates a decrease. IEMOCAP STC [30] 0.613 0.604 0.617 LSTM-GIN [31] 0.647 0.655 -CA-MSER [32] 0.698 0.711 -SpeechFormer [24] 0.629 0.645 -ISNet [33] 0.704 0.65 -DST [22] 0.718 0.736 -ShiftSER [34] 0.721 0.727 -TFC-SpeechFormer(Ours) 0.746 0.751 0.743 DAIC-WOZ FVTC-CNN [35] 0.735 0.656 0.64 Saidi [36] 0.68 0.68 0.68 EmoAudioNet [37] 0.732 0.649 0.653 Solieman [38] 0.66 0.615 0.61 SIMSIAM-S [39] 0.703 --TOAT [40] 0.717 0.429 0.48 SpeechFormer [24] 0.686 0.65 0.694 TFC-SpeechFormer(Ours) 0.762 0.701 0.714 consequently enhancing the feature representation capability. This, in turn, enables the model to better understand the emotional implications within speech signals.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
“…Our SpeechFormer++ with hand-crafted features outperforms STC [14] (0.645 vs. 0.613 in WA, 0.658 vs. 0.604 in UA and 0.649 vs. 0.617 in WF1) and achieves comparable results to LSTM-GIN [17] (0.645 vs. 0.647 in WA and 0.658 vs. 0.655 in UA) under the same experimental setup. SpeechFormer++ obtains inferior results compared to ISNet [15]. We suspect this is because ISNet is equipped with a carefully designed individual benchmark to alleviate the problem of interindividual emotion confusion.…”
Section: A Speech Emotion Recognition On Iemocap 1) Comparison To Tra...mentioning
confidence: 99%
“…The majority (31/51, 60.8%) of the study items adopted a uni-modal approach with audio as the sole modality. A smaller number of studies used multi-modal approaches that included: (a) only audio and video (2/51, 3.9%), [45][46][47][48][49][50][51], (b) only audio and text (5/51, 9.8%), and (c) audio, video and text (12/51, 23.5%). Only one study (1.9%) utilized physiological signals for SERC [52].…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%
“…ResNet18 [32]) or pre-trained transfer-learned feature extractors, e.g. Wav2vec [54]), here accounting for 25.5%(13/51) of total studies; (c) image transformations, summing to 19.6%(10/51) as yielded by advanced signal processing methods of raw waveforms, such as, spectrograms [48,55] or Mel-Frequency Cepstral Coefficients (MFCCs) [47,56]; (d) hybrid approaches, as combinations of two or three of the aforementioned options, here appearing in 25.5%(13/51) of study items. A trend towards deep learning based approaches in SERC can be observed after 2019, when standalone or hybrid deep learning emerged.…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%