Given an image (or video clip, or audio song), how do we automatically assign keywords to it? The general problem is to find correlations across the media in a collection of multimedia objects like video clips, with colors, and/or motion, and/or audio, and/or text scripts. We propose a novel, graph-based approach, "MMG", to discover such cross-modal correlations.Our "MMG" method requires no tuning, no clustering, no user-determined constants; it can be applied to any multimedia collection, as long as we have a similarity function for each medium; and it scales linearly with the database size. We report auto-captioning experiments on the "standard" Corel image database of 680 MB, where it outperforms domain specific, fine-tuned methods by up to 10 percentage points in captioning accuracy (50% relative improvement).
This study was carried out to recover valuable metals from the printed circuit boards (PCBs) of waste computers. PCB samples were crushed to smaller than 1 mm by a shredder and initially separated into 30% conducting and 70% nonconducting materials by an electrostatic separator. The conducting materials, which contained the valuable metals, were then used as the feed material for magnetic separation, where it was found that 42% of the conducting materials were magnetic and 58% were nonmagnetic. Leaching of the nonmagnetic component using 2 M H 2 SO 4 and 0.2 M H 2 O 2 at 85°C for 12 hr resulted in greater than 95% extraction of Cu, Fe, Zn, Ni, and Al. Au and Ag were extracted at 40°C with a leaching solution of 0.2 M (NH 4 ) 2 S 2 O 3 , 0.02 M CuSO 4 , and 0.4 M NH 4 OH, which resulted in recovery of more than 95% of the Au within 48 hr and 100% of the Ag within 24 hr. The residues were next reacted with a 2 M NaCl solution to leach out Pb, which took place within 2 hr at room temperature.
Facial Expression Recognition (FER) is a challenging task that improves natural humancomputer interaction. This paper focuses on automatic FER on a single in-the-wild (ITW) image. ITW images suffer real problems of pose, direction, and input resolution. In this study, we propose a pyramid with super-resolution (PSR) network architecture to solve the ITW FER task. We also introduce a prior distribution label smoothing (PDLS) loss function that applies the additional prior knowledge of the confusion about each expression in the FER task. Experiments on the three most popular ITW FER datasets showed that our approach outperforms all the state-of-the-art methods.
Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods. INDEX TERMS Speech emotion recognition, multi-level multi-head fusion attention, RNN, audio features, textual features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.