Deep learning-based late fusion of multimodal information for emotion classification of music video

Pandeya, Yagya Raj; Lee, Joonwhoan

doi:10.1007/s11042-020-08836-3

Cited by 128 publications

(64 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of strategies have been proposed to combine the learning from multiple representations [ 24 , 45 , 55 , 56 ]. Broadly, the methods can be categorized as early-fusion, mid-fusion, and late-fusion [ 57 , 58 , 59 , 60 ]. These refer to the classification stage at which the information is combined, such as combining the inputs to the CNN in early-fusion, combining the weights of the middle layers of the CNN in mid-fusion and combining the CNN outputs in late-fusion.…”

Section: Literature Reviewmentioning

confidence: 99%

Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks

Sharan

Xiong

Berkovsky

2021

Sensors

View full text Add to dashboard Cite

Audio signal classification finds various applications in detecting and monitoring health conditions in healthcare. Convolutional neural networks (CNN) have produced state-of-the-art results in image classification and are being increasingly used in other tasks, including signal classification. However, audio signal classification using CNN presents various challenges. In image classification tasks, raw images of equal dimensions can be used as a direct input to CNN. Raw time-domain signals, on the other hand, can be of varying dimensions. In addition, the temporal signal often has to be transformed to frequency-domain to reveal unique spectral characteristics, therefore requiring signal transformation. In this work, we overview and benchmark various audio signal representation techniques for classification using CNN, including approaches that deal with signals of different lengths and combine multiple representations to improve the classification accuracy. Hence, this work surfaces important empirical evidence that may guide future works deploying CNN for audio signal classification purposes.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks

Sharan

Xiong

Berkovsky

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…At this time, the music label is essential to the quality of music retrieval. In addition to music retrieval, many recommendation and subscription scenarios also require music category information to provide users with more accurate content [4,5].…”

Section: Introductionmentioning

confidence: 99%

Music Feature Extraction and Classification Algorithm Based on Deep Learning

Zhang

2021

Scientific Programming

View full text Add to dashboard Cite

With the rapid development of information technology and communication, digital music has grown and exploded. Regarding how to quickly and accurately retrieve the music that users want from huge bulk of music repository, music feature extraction and classification are considered as an important part of music information retrieval and have become a research hotspot in recent years. Traditional music classification approaches use a large number of artificially designed acoustic features. The design of features requires knowledge and in-depth understanding in the domain of music. The features of different classification tasks are often not universal and comprehensive. The existing approach has two shortcomings as follows: ensuring the validity and accuracy of features by manually extracting features and the traditional machine learning classification approaches not performing well on multiclassification problems and not having the ability to be trained on large-scale data. Therefore, this paper converts the audio signal of music into a sound spectrum as a unified representation, avoiding the problem of manual feature selection. According to the characteristics of the sound spectrum, the research has combined 1D convolution, gating mechanism, residual connection, and attention mechanism and proposed a music feature extraction and classification model based on convolutional neural network, which can extract more relevant sound spectrum characteristics of the music category. Finally, this paper designs comparison and ablation experiments. The experimental results show that this approach is better than traditional manual models and machine learning-based approaches.

show abstract

“…This article seeks to enhance and improve a supervised music video dataset [ 16 ]. The dataset includes diversified music video samples in six emotional categories and is used in various unimodal and multimodal architectures to analyze music, video, and facial expressions.…”

Section: Introductionmentioning

confidence: 99%

“…We conducted an ablation study on unimodal and multimodal architectures from scratch by using a variety of convolution filters. The major contributions of this study are listed below: We extended and improved an existing music video dataset [ 16 ] and provided emotional annotation by using multiple annotators of diversified cultures. A detailed description of the dataset and statistical information is provided in Section 3 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep-Learning-Based Multimodal Emotion Classification for Music Videos

Pandeya

Bhattarai

Lee

2021

Sensors

Self Cite

View full text Add to dashboard Cite

Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.

show abstract

Deep learning-based late fusion of multimodal information for emotion classification of music video

Cited by 128 publications

References 50 publications

Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks

Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks

Music Feature Extraction and Classification Algorithm Based on Deep Learning

Deep-Learning-Based Multimodal Emotion Classification for Music Videos

Contact Info

Product

Resources

About