Improving Image Paragraph Captioning with Dual Relations

Liu, Yun; Shi, Yihui; Feng, Fan; Li, Ruifan; Ma, Zhanyu; Wang, Xiaojie

doi:10.1109/icme52920.2022.9859701

Cited by 7 publications

(1 citation statement)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DAM model determines the order of images based on spatial locations so that getting rid of verbose on the same object. Wang et al [116] designed Convolutional Auto-Encoding (CAE) networks to model the topics on region-level features, and further feed these topic vectors into a two-layer LSTM network. Liu et al [117] proposed DuelRel model to capture both spatial and semantic relationships, where spatial relations are acquired from a geometry pattern and semantic relations are modeled in a weakly supervised manner.…”

Section: Other Generation Related Tasksmentioning

confidence: 99%

A Survey of Vision and Language Related Multi-Modal Task

Wang¹,

Hu²,

Qiu³

et al. 2022

CAAI Artificial Intelligence Research

View full text Add to dashboard Cite

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

show abstract

Section: Other Generation Related Tasksmentioning

confidence: 99%