Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions

Baier, Stephan; Ma, Yunpu; Tresp, Volker

doi:10.1007/978-3-319-68288-4_4

Cited by 41 publications

(49 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These scores are combined with a language prior score (based on word embeddings) that models the semantics of the visual relationships. The methods in [2] also combines visual and semantic information. However, link prediction methods (RESCAL, MultiwayNN, CompleEx, DistMult) are used for modelling the visual relationship semantics in place of word embeddings.…”

Section: Methodsmentioning

confidence: 99%

“…The visual knowledge consists in the features of the union of the subject and object bounding boxes. In [2] the background knowledge is statistical information (learnt with statistical link prediction methods [21]) about the training set triples. Contextual information between objects is used also in [23], [35] with different learning methods.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Donadello

Serafini

2019

2019 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Semantic Image Interpretation is the task of extracting a structured semantic description from images. This requires the detection of visual relationships: triples subject, relation, object describing a semantic relation between a subject and an object. A pure supervised approach to visual relationship detection requires a complete and balanced training set for all the possible combinations of subject, relation, object . However, such training sets are not available and would require a prohibitive human effort. This implies the ability of predicting triples which do not appear in the training set. This problem is called zero-shot learning. State-of-the-art approaches to zero-shot learning exploit similarities among relationships in the training set or external linguistic knowledge. In this paper, we perform zero-shot learning by using Logic Tensor Networks, a novel Statistical Relational Learning framework that exploits both the similarities with other seen relationships and background knowledge, expressed with logical constraints between subjects, relations and objects. The experiments on the Visual Relationship Dataset show that the use of logical constraints outperforms the current methods. This implies that background knowledge can be used to alleviate the incompleteness of training sets.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Donadello

Serafini

2019

2019 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

show abstract

“…An example of this are the Logic Tensor Networks in [46], where the authors show that encoding prior knowledge in symbolic form allows for better learning results on fewer training data, as well as more robustness against noise. A similar example is given in [47], where knowledge graphs are successfully used as priors in a scene description task, and in [48] where logical rules are used as background knowledge for a gradient descent learning task in a high-dimensional real-valued vector space.…”

Section: Learning With Symbolic Information As a Priormentioning

confidence: 99%

A Boxology of Design Patterns forHybrid Learningand Reasoning Systems

Harmelen

Teije

2019

JWE

View full text Add to dashboard Cite

We propose a set of compositional design patterns to describe a large variety of systems that combine statistical techniques from machine learning with symbolic techniques from knowledge representation. As in other areas of computer science (knowledge engineering, software engineering, ontology engineering, process mining and others), such design patterns help to systematize the literature, clarify which combinations of techniques serve which purposes, and encourage re-use of software components. We have validated our set of compositional design patterns against a large body of recent literature.

show abstract

“…Visual Appearance Features are extracted from the predicate box, i.e. the minimum rectangle that encompasses the subject box and the object box [1,12,2,13,14,3], the separate subject-object boxes [5,15,16,17], or both [18,19,4,20,21]. All the above train a single branch with visual features, while we jointly train two separate branches with different features, a predicate feature branch (P-branch) and an object-subject branch (OS-branch), and employ Deep Supervision to align their scores into a common space.…”

Section: Related Workmentioning

confidence: 99%

“…Linguistic and Semantic Features are employed in a feature-level integration with word embeddings [1,13,4,12,3], encoding of statistics [18,13,4,14], late-fusion with subject-object classemes (score vectors) [22,5,14,19] and loss-level fusion as regularization terms [1,3] or adaptive-margins [4,19,12]. Closest to us, [2] uses subject-object embeddings to train context-aware classifiers and [20] trains multimodal embeddings by projecting visual and linguistic features into a common space.…”

Section: Related Workmentioning

confidence: 99%

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Gkanatsios¹,

Pitsikalis²,

Koutras

et al. 2019

2019 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Detecting visual relationships, i.e. triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a lowdimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by reimplementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.

show abstract

Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions

Cited by 41 publications

References 35 publications

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

Compensating Supervision Incompleteness with Prior Knowledge in Semantic Image Interpretation

A Boxology of Design Patterns forHybrid Learningand Reasoning Systems

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Contact Info

Product

Resources

About