Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pretrained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-toend architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
In this work, travel destinations and business locations are taken as venues. Discovering a venue by a photograph is very important for visual context-aware applications. Unfortunately, few efforts paid attention to complicated real images such as venue photographs generated by users. Our goal is fine-grained venue discovery from heterogeneous social multimodal data. To this end, we propose a novel deep learning model, category-based deep canonical correlation analysis. Given a photograph as input, this model performs: 1) exact venue search (find the venue where the photograph was taken) and 2) group venue search (find relevant venues that have the same category as the photograph), by the cross-modal correlation between the input photograph and textual description of venues. In this model, data in different modalities are projected to a same space via deep networks. Pairwise correlation (between different modality data from the same venue) for exact venue search and category-based correlation (between different modality data from different venues with the same category) for group venue search are jointly optimized. Because a photograph cannot fully reflect rich text description of a venue, the number of photographs per venue in the training phase is increased to capture more aspects of a venue. We build a new venue-aware multimodal data set by integrating Wikipedia featured articles and Foursquare venue photographs. Experimental results on this data set confirm the feasibility of the proposed method. Moreover, the evaluation over another publicly available data set confirms that the proposed method outperforms state of the arts for cross-modal retrieval between image and text.
A fast technique for automatic 3-D shape measurement is proposed and verified by experiments. The technique, based on the principle of phase measurement of the deformed grating pattern which carries the 3-D information of the measured object, can automatically and accurately obtain the phase map or the height information of a measured object at every pixel point without assigning fringe orders and interpreting data in the regions between the fringe orders. Only one image pattern is sufficient for obtaining the phase map. In contrast to the fast Fourier transform based technique, the technique processes a fringe pattern in the real-signal domain instead of the frequency domain by using demodulation and convolution techniques, can process an arbitrary number of pixel points, and is much faster. Theoretical analysis, simulation results, and experimental results are presented.
Face hallucination is a technique that reconstruct high-resolution (HR) faces from low-resolution (LR) faces, by using the prior knowledge learned from HR/LR face pairs. Most state-of-the-arts leverage position-patch prior knowledge of human face to estimate the optimal representation coefficients for each image patch. However, they focus only the position information and usually ignore the context information of image patch. In addition, when they are confronted with misalignment or the Small Sample Size (SSS) problem, the hallucination performance is very poor. To this end, this study incorporates the contextual information of image patch and proposes a powerful and efficient context-patch based face hallucination approach, namely Thresholding Locality-constrained Representation and Reproducing learning (TLcR-RL). Under the context-patch based framework, we advance a thresholding based representation method to enhance the reconstruction accuracy and reduce the computational complexity. To further improve the performance of the proposed algorithm, we propose a promotion strategy called reproducing learning. By adding the estimated HR face to the training set, which can simulates the case that the HR version of the input LR face is present in the training set, thus iteratively enhancing the final hallucination result. Experiments demonstrate that the proposed TLcR-RL method achieves a substantial increase in the hallucinated results, both subjectively and objectively. Additionally, the proposed framework is more robust to face misalignment and the SSS problem, and its hallucinated HR face is still very good when the LR test face is from the real-world. The MATLAB source code is available at https://github.com/junjun-jiang/TLcR-RL.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.