MemCap: Memorizing Style Knowledge for Image Captioning

Zhao, Wentian; Wu, Xinxiao; Zhang, Xiaoxun

doi:10.1609/aaai.v34i07.6998

Cited by 61 publications

(38 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some recent image captioning studies [26,21,31] have constructed variant LSTM language models to learn factual and non-factual knowledge in corpora. Some studies [32,21,33,34] have allowed for learning non-factual knowledge in unpaired corpora via weakly supervised or unsupervised methods. These methods are expected (but not limited) to be used to train better models for autogenerating factual and functional captions of drug paraphernalia.…”

Section: Discussionmentioning

confidence: 99%

“…MSCap [33] was designed to generate multiple stylized descriptions by training a single captioning model on an unpaired non-factual corpus with the help of several auxiliary modules. Zhao et al [34] proposed a new model named MemCap, which resorts to explicitly encoding non-factual knowledge by building a memory module.…”

Section: Related Work a Image Captioning Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

DrunaliaCap: Image Captioning for Drug-Related Paraphernalia With Deep Learning

Zhao

2020

IEEE Access

View full text Add to dashboard Cite

Image captioning is a process of generating textual descriptions of images. In recent years, research on publicly available large-scale datasets and deep learning-based algorithms has promoted the development of this field. However, little research has been conducted on captioning images of drug-related paraphernalia that, despite being an important topic for both drug prevention and police enforcement, is not covered by existing image captioning studies. In this paper, we propose DrunaliaCap-a deep learningbased system for autogenerating both "factual" (what is in the image) and "functional" (the usage of each paraphernalia during drug-taking) descriptions of images of drug-related paraphernalia. We constructed a new dataset containing 20 categories of drug-related items and trained deep learning-based models for the proposed system. We further proposed a method to evaluate and optimize the generation of captions to prevent them from missing important knowledge. Experiments were conducted to validate the performance of the newly proposed dataset and method. We analyzed the experimental results and discussed the significance, limitations, and potential applications of our work. INDEX TERMS image captioning, drug prevention, dataset construction, deep learning VOLUME 4, 2016 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Work a Image Captioning Datasetsmentioning

confidence: 99%

DrunaliaCap: Image Captioning for Drug-Related Paraphernalia With Deep Learning

Zhao

2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Basic encoder-decoder architecture is composed of CNN (as encoder) and RNN (Recurrent Neural Networks) (as a decoder). The image is fed to CNN for feature conversion while features are fed to RNN for mapping against the annotation words [12,13,14,15]. To make the network more innovative and efficient various additions are done in the model for example incorporation of visual attention mechanisms [16,17], region of interests, and attention behaviors [18,19].…”

Section: Related Workmentioning

confidence: 99%

Image Captioning With Positional and Geometrical Semantics

Haque

Ghani

Saeed³

2021

IEEE Access

View full text Add to dashboard Cite

The last 5 to 6 years have seen tremendous progress in automatic image captioning using deep learning. Initial research focused on the attribute-to-attribute comparison of image features and texts to describe the image as a sentence, the current research is handling issues related to semantics and correlations. However, current state of art research suffers from insufficient concepts when it comes to positional and geometrical attributes. The majority of research relying on CNN's (Convolutional Neural Networks) for object feature extractions has no clue about equivariance and rotational invariance which leads towards the orientation-less understanding of objects for captioning along with longer training time, and larger dataset. Furthermore, CNN's based image captioning encoders also fail to understand the geometrical alignment of object attributes within the image and hence mislabels distorted as correct. To cater to the above issues, we propose ICPS (Captioning with Positional and geometrical Semantics) a capsule network-based image captioning technique along with transformer neural networks as the decoder. The proposed ICPS architecture handles various geometrical properties of image objects with the help of parallelized capsules while the object-to-text decoding is done by Transformer Neural Networks. The inclusion of cluster capsules provides better object understanding in terms of position, equivariance, and geometrical orientation with more augmented object understanding over a small dataset in comparatively less time. The extracted image features provide a better understanding of image objects and help the decoding stage to narrate effectively with positional and geometrical details. We trained and tested our ICPS over the Flickr8k dataset and found ourselves to be better at captioning when it comes to describing the positional and geometrical transitions as compared to other current state-of-the-art research.

show abstract

“…Glove is for representing words. Zhao et al [157] have proposed 'MemCap', a novel stylized image captioning method that explicitly encodes the knowledge about linguistic styles with memory mechanism. The authors have proposed to implement VGG-16 with Faster-RCNN for visual feature extraction.…”

Section: Feature Extractionmentioning

confidence: 99%

“…Jia and Li [48] have proposed LSTM as a sentence generator. Zhao et al [157] have proposed to generate captions such as the SN Computer Science proposed model 'MemCap' first extracts content-relevant style knowledge from the memory module via an attention mechanism and then incorporates the extracted knowledge into a language model. The decoder of the proposed model of Chen and Jin [17] is an RNN model of LSTM cell.…”

Section: Sentence Generationmentioning

confidence: 99%

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

et al. 2021

View full text Add to dashboard Cite

Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.

show abstract

MemCap: Memorizing Style Knowledge for Image Captioning

Cited by 61 publications

References 18 publications

DrunaliaCap: Image Captioning for Drug-Related Paraphernalia With Deep Learning

DrunaliaCap: Image Captioning for Drug-Related Paraphernalia With Deep Learning

Image Captioning With Positional and Geometrical Semantics

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

Contact Info

Product

Resources

About