Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model

Azhar, Yufis; Anugerah, M. Randy; Fahlopy, Muhammad Al Reza; Yusriansyah, Alfin

doi:10.22219/kinetik.v7i4.1568

Cited by 4 publications

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several approaches for image captioning have been made from deep learning encoder-decoder based models with CNN to extract the spatial and visual features and RNN to generate them in sequence [10,11]. A spectrum of encoding models has been explored to enhance image captioning systems, encompassing diverse architectures such as Inception-v3, Visual Geometry Group Network (VGGNet), Inception-v3 augmented with LSTM as a decoder [12], Residual Network 152 layer (ResNet-152) [13], and VGG-16 [14]. Notably, employing transfer learning through pre-trained encoders, commonly derived from ImageNet, has demonstrated superior outcomes [15].…”

Section: Related Workmentioning

confidence: 99%

Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement

Tsaniya,

Fatichah,

Suciati

2024

IEEE Access

View full text Add to dashboard Cite

Writing radiology reports based on radiographic images is a time-consuming task that demands the expertise of skilled radiologists. Consequently, the integration of technology capable of automated report generation would be advantageous. Developing a coherent predictive text is the main challenge in automatic report generation. It is necessary to develop methods that can increase the relevance of features in producing predictive text. This study constructed a medical report generator model using the transformer approach and image enhancement implementation. To leverage the visual and semantic features, an approach to enhance the noise-prone nature of the medical image is explored in this study along with the transformers method to generate a radiology report based on Chest X-ray images. Four contrast-based image enhancement methods were used to investigate the effect of image enhancement techniques on the radiology report generator. The encoder-decoder model is used with text feature embedding using Bidirectional Encoder Representation from Transformer (BERT) and visual feature extraction utilizing a pre-trained model ChexNet and Multi-Head Attention (MHA) mechanism. The performance of the MHA model with gamma correction is 5% in better with a 0.377 value using the Bilingual Assessment Understudy (BLEU) with 4 n-gram evaluation. MHA also produces 15% better results with a 0.412 value than the baseline model. This method is able to outperform the baseline model and other previous works. It can be concluded that the use of transformer MHA encoder layer and BERT is effective in leveraging visual and text features. Additionally, the inclusion of an image enhancement approach has been found to have a positive impact on the model's performance.

show abstract

Section: Related Workmentioning

confidence: 99%