In this paper, we are studying the problem of recognizing attribute-object pairs that do not appear in the training dataset, which is called unseen attribute-object pair recognition. Existing methods mainly learn a discriminative classifier or compose multiple classifiers to tackle this problem, which exhibit poor performance for unseen pairs. The key reasons for this failure are 1) they have not learned an intrinsic attributeobject representation, and 2) the attribute and object are processed either separately or equally so that the inner relation between the attribute and object has not been explored. To explore the inner relation of attribute and object as well as the intrinsic attribute-object representation, we propose a generative model with the encoder-decoder mechanism that bridges visual and linguistic information in a unified end-to-end network. The encoder-decoder mechanism presents the impressive potential to find an intrinsic attribute-object feature representation. In addition, combining visual and linguistic features in a unified model allows to mine the relation of attribute and object. We conducted extensive experiments to compare our method with several state-of-the-art methods on two challenging datasets. The results show that our method outperforms all other methods.
Lane boundary detection technology has progressed rapidly over the past few decades. However, many challenges that often lead to lane detection unavailability remain to be solved. In this paper, we propose a spatial-temporal knowledge filtering model to detect lane boundaries in videos. To address the challenges of structure variation, large noise and complex illumination, this model incorporates prior spatial-temporal knowledge with lane appearance features to jointly identify lane boundaries. The model first extracts line segments in video frames. Two novel filters—the Crossing Point Filter (CPF) and the Structure Triangle Filter (STF)—are proposed to filter out the noisy line segments. The two filters introduce spatial structure constraints and temporal location constraints into lane detection, which represent the spatial-temporal knowledge about lanes. A straight line or curve model determined by a state machine is used to fit the line segments to finally output the lane boundaries. We collected a challenging realistic traffic scene dataset. The experimental results on this dataset and other standard dataset demonstrate the strength of our method. The proposed method has been successfully applied to our autonomous experimental vehicle.
Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components.
CCS CONCEPTS• Computing methodologies → Computer vision tasks; • Information systems → Question answering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.