Image Caption Generation Using Neural Network Models and LSTM Hierarchical Structure

Waghmare, Prachi; Shinde, Swati

doi:10.1007/978-981-16-2543-5_10

Cited by 4 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It needs to contain as many scenes as possible, preferably constructed by computer vision experts. Finally, applying the fine‐grained scene graph generation model to other computer vision and multi‐media tasks [WS22, DJX*21], such as content‐based image search, image captioning, visual question answering, and multi‐modal knowledge graph construction. The finegrained scene graph generation model can provide a better representation for these scene understanding‐related tasks and can significantly improve the model performance of these tasks.…”

Section: Discussionmentioning

confidence: 99%

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

Zhao

Jin

Zhao

et al. 2022

Computer Graphics Forum

View full text Add to dashboard Cite

Scene graph generation refers to the task of identifying the objects and specifically the relationships between the objects from an image. Existing scene graph generation methods generally use the bounding boxes region features of objects to identify the relationships between objects. However, we feel that the overlap region features of two objects may play an important role in fine‐grained relationship identification. In fact, some fine‐grained relationships can only be obtained from the overlap region features of two objects. Therefore, we propose the Multi‐Branch Feature Combination (MFC) module and Overlap Region Transformer (ORT) module to comprehensively obtain the visual features contained in the overlap regions of two objects. Concretely, the MFC module uses deconvolution and multi‐branch dilation convolution to obtain high‐pixels and multi‐receptive field features in the overlap regions. The ORT module uses the vision transformer to obtain the self‐attention of the overlap regions. The joint use of these two modules achieves the mutual complementation of local connectivity properties of convolution and the global connectivity properties of attention. We also design a Geometrical Center Augmented (GCA) module to obtain the relative position information of the geometric centers between two objects, to prevent the problem that only relying on the scale of the overlap region cannot accurately capture the relationship between two objects. Experiments show that our model ORGC (Overlap Region and Geometrical Center), the combination of the MFC module, the ORT module, and the GCA module, can enhance the performance of fine‐grained relation identification. On the Visual Genome dataset, our model outperforms the current state‐of‐the‐art model by 4.4% on the R@50 evaluation metric, reaching a state‐of‐the‐art result of 33.88.

show abstract

Section: Discussionmentioning

confidence: 99%

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

Zhao

Jin

Zhao

et al. 2022

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Thangavel¹,

Palanisamy²,

Muthusamy³

et al. 2023

Soft Comput

View full text Add to dashboard Cite

Deep-learning-based image captioning： analysis and prospects

Zhao,

Jin,

Zhang

et al. 2023

Journal of Image and Graphics

View full text Add to dashboard Cite

The task of image captioning is to use a computer in automatically generating a complete， smooth， and suitable corresponding scene' s caption for a known image and realizing the multimodal conversion from image to text. Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence， which has a wide range of applications in research and production. Image captioning can be applied to many aspects of social develop-ment， such as text captions of images and videos， visual question answering， storytelling by looking at the image， network image analysis， and keyword search of an image. Image captions can also assist individuals born with visual impairments， making the computer another pair of eyes for them. The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology. On the basis of extensive literature research，中图法分类号： TP183文献标识码： A

show abstract

Image Caption Generation Using Neural Network Models and LSTM Hierarchical Structure

Cited by 4 publications

References 15 publications

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Deep-learning-based image captioning： analysis and prospects

Contact Info

Product

Resources

About