Scene graph generation has received growing attention with the advancements in image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, or often come with noisy and missing annotations, which makes the development of a reliable scene graph prediction model very challenging. In this paper, we propose a novel scene graph generation algorithm with external knowledge and image reconstruction loss to overcome these dataset issues. In particular, we extract commonsense knowledge from the external knowledge base to refine object and phrase features for improving generalizability in scene graph generation. To address the bias of noisy object annotations, we introduce an auxiliary image reconstruction path to regularize the scene graph generation network. Extensive experiments show that our framework can generate better scene graphs, achieving the state-of-the-art performance on two benchmark datasets: Visual Relationship Detection and Visual Genome datasets.
Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graphbased approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.
Identifying different types of data outliers with abnormal behaviors in multi-view data setting is challenging due to the complicated data distributions across different views. Conventional approaches achieve this by learning a new latent feature representation with the pairwise constraint on different view data. In this paper, we argue that the existing methods are expensive in generalizing their models from two-view data to three-view (or more) data, in terms of the number of introduced variables and detection performance. To address this, we propose a novel multi-view outlier detection method with consensus regularization on the latent representations. Specifically, we explicitly characterize each kind of outliers by the intrinsic cluster assignment labels and sample-specific errors. Moreover, we make a thorough discussion about the proposed consensus-regularization and the pairwise-regularization. Correspondingly, an optimization solution based on augmented Lagrangian multiplier method is proposed and derived in details. In the experiments, we evaluate our method on five well-known machine learning data sets with different outlier settings. Further, to show its effectiveness in real-world computer vision scenario, we tailor our proposed model to saliency detection and face reconstruction applications. The extensive results of both standard multi-view outlier detection task and the extended computer vision tasks demonstrate the effectiveness of our proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.