The interpretation of medical images into a natural language is a developing field of artificial intelligence (AI) called image captioning. This field integrates two branches of artificial intelligence which are computer vision and natural language processing. This is a challenging topic that goes beyond object recognition, segmentation, and classification since it demands an understanding of the relationships between various components in an image and how these objects function as visual representations. The content-based image retrieval (CBIR) uses an image captioning model to generate captions for the user query image. The common architecture of medical image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. We aim in this paper to build an optimized model for histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens. For the image feature extraction subsystem, we did two evaluations; first, we tested 5 different vision models (VGG, ResNet, PVT, SWIN-Large, and ConvNEXT-Large) using (LSTM, RNN, and bidirectional-RNN) and then compare the vision models with (LSTM-without augmentation, LSTM-with augmentation and BioLinkBERT-Large as an embedding layer-with augmentation) to find the accurate one. Second, we tested 3 different concatenations of pairs of vision models (SWIN-Large, PVT_v2_b5, and ConvNEXT-Large) to get among them the most expressive extracted feature vector of the image. For the caption generation lingual subsystem, we tested a pre-trained language embedding model which is BioLinkBERT-Large compared to LSTM in both evaluations, to select from them the most accurate model. Our experiments showed that building a captioning system that uses a concatenation of the two models ConvNEXT-Large and PVT_v2_b5 as an image feature extractor, combined with the BioLinkBERT-Large language embedding model produces the best results among the other combinations.
Automatic captioning of images contributes to identifying features of multimedia content and helps in the detection of interesting patterns, trends, and occurrences. English image captioning has recently made incredible progress; however, Arabic image captioning is still lagging. In the field of machine learning, Arabic image-caption generation is generally a very difficult problem. This paper presents a more accurate model for Arabic image captioning by using transformer models in both the encoder and decoder phases as feature extractors from images in the encoder phase and a pre-trained word embedding model in the decoder phase. The models are demonstrated, and all of them are implemented, trained, and tested on Arabic Flickr8k datasets. For the image feature extraction subsystem, we compared using three different individual vision models (SWIN, XCIT, and ConvNexT) with concatenation to get among them the most expressive extracted feature vector of the image, and for the caption generation lingual subsystem, which is tested by four different pre-trained language embedding models: (ARABERT, ARAELECTRA, MARBERTv2, and CamelBERT), to select from them the most accurate pre-trained language embedding model. Our experiments showed that building an Arabic image captioning system that uses a concatenation of the three transformer-based models ConvNexT combined with SWIN and XCIT as an image feature extractor, combined with the CamelBERT language embedding model produces the best results among the other combinations, having scores of 0.5980 with BLEU-1 and with ConvNexT combined with SWIN the araelectra language embedding model having a score of 0.1664 with BLEU-4 which are higher than the previously reported values of 0.443 and 0.157.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.