Image captioning by diffusion models: A survey

Daneshfar, Fatemeh; Bartani, Ako; Lotfi, Pardis

doi:10.1016/j.engappai.2024.109288

Engineering Applications of Artificial Intelligence

2024

DOI: 10.1016/j.engappai.2024.109288

|View full text |Cite

Image captioning by diffusion models: A survey

Fatemeh Daneshfar,

Ako Bartani,

Pardis Lotfi

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Thangka image captioning model with Salient Attention and Local Interaction Aggregator

Hu,

Zhang,

Zhao

2024

Herit Sci

View full text Add to dashboard Cite

Thangka image captioning aims to automatically generate accurate and complete sentences that describe the main content of Thangka images. However, existing methods fall short in capturing the features of the core deity regions and the surrounding background details of Thangka images, and they significantly lack an understanding of local actions and interactions within the images. To address these issues, this paper proposes a Thangka image captioning model based on Salient Attention and Local Interaction Aggregator (SALIA). The model is designed with a Dual-Branch Salient Attention Module (DBSA) to accurately capture the expressions, decorations of the deity, and descriptive background elements, and it introduces a Local Interaction Aggregator (LIA) to achieve detailed analysis of the characters’ actions, facial expressions, and the complex interactions with surrounding elements in Thangka images. Experimental results show that SALIA outperforms other state-of-the-art methods in both qualitative and quantitative evaluations of Thangka image captioning, achieving BLEU4: 94.0%, ROUGE_L: 95.0%, and CIDEr: 909.8% on the D-Thangka dataset, and BLEU4: 22.2% and ROUGE_L: 47.2% on the Flickr8k dataset.

show abstract

Thangka image captioning model with Salient Attention and Local Interaction Aggregator

Hu,

Zhang,

Zhao

2024

Herit Sci

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Image captioning by diffusion models: A survey

Cited by 1 publication

References 66 publications

Thangka image captioning model with Salient Attention and Local Interaction Aggregator

Thangka image captioning model with Salient Attention and Local Interaction Aggregator

Contact Info

Product

Resources

About