A Hierarchical Approach for Visual Storytelling Using Image Description

Nahian, Md. Sultan Al; Tasrin, Tasmia; Gandhi, Sagar; Gaines, Ryan; Harrison, Brent

doi:10.1007/978-3-030-33894-7_30

Cited by 12 publications

(8 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…B-1 B-2 B-3 B-4 CIDEr ROUGE-L METEOR AREL 2018 [32] 0.536 0.315 0.173 0.099 0.038 0.286 0.352 GLACNet 2018 [14] 0.56 0.321 0.171 0.091 0.041 0.264 0.306 HCBNet 2019 [1] 0.59 0.348 0.191 0.105 0.051 0.274 0.34 HCBNet(w/o prev. sent.…”

Section: Modelunclassified

“…sent. attention) [1] 0.59 0.338 0.180 0.097 0.057 0.271 0.332 HCBNet(w/o description attention) [1] 0.58 0.345 0.194 0.108 0.043 0.271 0.337 HCBNet(VGG) 2019 [1] 0.59 0.34 0.186 0.104 0.051 0.269 0.334 ReCo-RL 2020 [ Story In Sequence (SIS) which is more relevant to storytelling problems and comprises a whole paragraph in precisely five sentences representing a story. In all dataset statements, it is essential to note that the names of the individuals are adjusted by "[male and female]", places by "[location]", and organizations by "[organization]".…”

Section: Modelmentioning

confidence: 99%

See 1 more Smart Citation

Vision Transformer Based Model for Describing a Set of Images as a Story

Malakan,

Hassan,

Mian

2022

Preprint

View full text Add to dashboard Cite

Visual Story-Telling is the process of forming a multi sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

show abstract

Section: Modelunclassified

Section: Modelmentioning

confidence: 99%

Vision Transformer Based Model for Describing a Set of Images as a Story

Malakan,

Hassan,

Mian

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Automatically learning to map from image sequences to output stories is very challenging with no guidance, hence some approaches try to introduce some intermediate representation or data to help. A simple approach is taken by Nahian et al (2019), which encodes images and their associated text captions (from the VIST dataset) by separate encoders, and combines them, before decoding into the story sentences. Otherwise Nahian et al (2019) is a fairly straightforward encoder-decoder architecture.…”

Section: Exploiting Intermediate Data or Structuresmentioning

confidence: 99%

“…A simple approach is taken by Nahian et al (2019), which encodes images and their associated text captions (from the VIST dataset) by separate encoders, and combines them, before decoding into the story sentences. Otherwise Nahian et al (2019) is a fairly straightforward encoder-decoder architecture. Other works try to extract some semantic information from the images without simply using the caption given in the dataset.…”

Section: Exploiting Intermediate Data or Structuresmentioning

confidence: 99%

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Dai

Guérin

et al. 2021

Computer Speech & Language

View full text Add to dashboard Cite

Visual storytelling is a creative and challenging task, aiming to automatically generate a storylike description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

show abstract

“…The colossal success in image recognition was possible with recent advances in artificial intelligence and deep learning [1][2][3][4]. The rudimentary operation involved in such applications is multiply-and-accumulate (MAC).…”

Section: Introductionmentioning

confidence: 99%

Compact model of retention characteristics of ferroelectric FinFET synapse with MFIS gate stack

Baig

et al. 2021

Semicond. Sci. Technol.

View full text Add to dashboard Cite

In this paper, multiple-fin n- and p-channel HfZrO2 ferroelectric-FinFET devices are manufactured using a gate first process with post metalization annealing. The device transfer characteristics upon program and erase operations are measured and modeled. The drift in the transfer characteristics due to depolarization field and charge injection are captured using the shift in the threshold voltage along with time-dependent modeling of vertical field dependent mobility degradation parameters to develop a physical, computationally efficient, and accurate retention model for ferroelectric-FinFET devices. The modeled conductance is incorporated into deep neural network simulation platform CIMulator to analyze the role of conductance drift due to retention degradation, as well as the importance of the gap between high and low conductance states in improving the image recognition accuracy of neural networks.

show abstract

A Hierarchical Approach for Visual Storytelling Using Image Description

Cited by 12 publications

References 20 publications

Vision Transformer Based Model for Describing a Set of Images as a Story

Vision Transformer Based Model for Describing a Set of Images as a Story

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Compact model of retention characteristics of ferroelectric FinFET synapse with MFIS gate stack

Contact Info

Product

Resources

About