DFEN: Dual Feature Enhancement Network for Remote Sensing Image Caption

Zhao, Wei; Yang, Wenzhong; Chen, Danny; Wei, Fuyuan

doi:10.3390/electronics12071547

Cited by 7 publications

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Phrase comprehension (PC) is a fundamental task in the multi-modal learning community and serves as the basis for many downstream tasks, including image captioning [1,2], visual question answering [3,4], etc. The purpose of PC is to locate a specific entity in an image according to a given linguistic query.…”

Section: Introductionmentioning

confidence: 99%

Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension

Wang,

Yue,

2024

Electronics

View full text Add to dashboard Cite

Phrase comprehension (PC) aims to locate a specific object in an image according to a given linguistic query. The existing PC methods work in either a fully supervised or proposal-based weakly supervised manner, which rely explicitly or implicitly on expensive region annotations. In order to completely remove the dependence on the supervised region information, this paper proposes to address PC in a proposal-free weakly supervised training paradigm. To this end, we developed a novel cascaded searching reinforcement learning agent (CSRLA). Concretely, we first leveraged a visual language pre-trained model to generate a visual–textual cross-modal attention heatmap. Accordingly, a coarse salient initial region of the referential target was located. Then, we formulated the visual object grounding as a Markov decision process (MDP) in a reinforcement learning framework, where an agent was trained to iteratively search for the target’s complete region from the salient local region. Additionally, we developed a novel confidence discrimination reward function (ConDis_R) to constrain the model to search for a complete and exclusive object region. The experimental results on three benchmark datasets of Refcoco, Refcoco+, and Refcocog demonstrated the effectiveness of our proposed method.

show abstract