2020
DOI: 10.48550/arxiv.2008.09150
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VisualSem: A High-quality Knowledge Graph for Vision and Language

Abstract: We argue that the next frontier in natural language understanding (NLU) and generation (NLG) will include models that can efficiently access external structured knowledge repositories. In order to support the development of such models, we release the VisualSem knowledge graph (KG) which includes nodes with multilingual glosses and multiple illustrative images and visually relevant relations. We also release a neural multi-modal retrieval model that can use images or sentences as inputs and retrieves entities … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 16 publications
0
8
0
Order By: Relevance
“…Three knowledge graph datasets are adopted in the pre-training process. Visu-alSem [2] is a high-quality multi-modal knowledge graph dataset for vision and language concepts, including entities with multilingual glosses, multiple illustrative images, and visually relevant relations, covering a total number of 90k nodes, 1.3M glosses and 938k images. 13 semantic relations are used to connect different entities in the graph, while the entities in VisualSem are linked to Wikipedia articles, WordNet [34], and high-quality images from ImageNet [10].…”
Section: Methods Flickr30k (1k Test Set) Mscoco(5k Test Set) Text Ret...mentioning
confidence: 99%
See 3 more Smart Citations
“…Three knowledge graph datasets are adopted in the pre-training process. Visu-alSem [2] is a high-quality multi-modal knowledge graph dataset for vision and language concepts, including entities with multilingual glosses, multiple illustrative images, and visually relevant relations, covering a total number of 90k nodes, 1.3M glosses and 938k images. 13 semantic relations are used to connect different entities in the graph, while the entities in VisualSem are linked to Wikipedia articles, WordNet [34], and high-quality images from ImageNet [10].…”
Section: Methods Flickr30k (1k Test Set) Mscoco(5k Test Set) Text Ret...mentioning
confidence: 99%
“…To better fit our pre-training framework, we convert the original image-text pair into the form of triplets, with specifically designed relations 'image of' and 'caption of'. (2) We also use the original CLIP model as the teacher, and use an auxiliary loss L KD to measure the KL distance between the output of CLIP and our model.…”
Section: Training Targetsmentioning
confidence: 99%
See 2 more Smart Citations
“…In Tab. 1, we list mainstream multimodal knowledge graph datasets [1,33,43,54,79,85,87,97], constructed by texts and images with detailed information. For instance, VisualGenome [33] is a multimodal knowledge graph which contains 40,480 relations, 108,077 image nodes with objects.…”
Section: Multimodel Knowledge Graphmentioning
confidence: 99%