After it's inception, COVID-19 has spread rapidly all across the globe. Considering this outbreak, by far, it is the most decisive task to detect early and isolate the patients quickly to contain the spread of this virus. In such cases, artificial intelligence and machine learning or deep learning methods can come to aid. For that purpose, we have conducted a qualitative investigation to inspect 12 off-the-shelf Convolution Neural Network (CNN) architectures in classifying COVID-19 from CT scan images. Furthermore, a segmentation algorithm for biomedical images -U-Net, is analyzed to evaluate the performance of the CNN models. A publicly available dataset (SARS-COV-2 CT-Scan) containing a total of 2481 CT scan images is employed for the performance evaluation. In terms of feature extraction by excluding the segmentation technique, a performance of 88.60% as the F1 Score and 89.31% as accuracy is achieved by training DenseNet169 architecture. Adopting the U-Net segmentation method, we accomplished the most optimal accuracy and F1 Scores as 89.92% and 89.67% respectively on DenseNet201 model. Furthermore, evaluating the performances, we can affirm that a combination of a Transfer Learning architecture with a segmentation technique (U-Net) enhances the performance of the classification model.
Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.