In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.
Over the past few years, the imaging device has changed from digital cameras to smartphone cameras. With the popularity of mobile Internet applications, there explode massive digital images and videos captured by such smartphones, which are nearly held one per person. Consequently, the capturing source of images/videos delivers valuable identity information for criminal investigations and critical forensic evidence. It is significant to address the source identification of smartphone images/videos. In this paper, we build a Daxing smartphone identification dataset, which collects images and videos from extensive smartphones of different brands, models and devices. Specifically, the dataset includes 43 400 images and 1,400 videos captured by 90 smartphones of 22 models belonging to 5 brands. For example, there are 23 smartphone devices for the iPhone 6S (Plus) model. To the best of our knowledge, Daxing dataset uses the largest amount of smartphones for image/video source identification compared with other related datasets, as well as the highest numbers of devices per model and captured images/videos. The dataset has been released as a free and open-source for scientific researchers and criminal investigators.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.