Hierarchical Photo-Scene Encoder for Album Storytelling

Wang, Bairui; Ma, Lin; Zhang, Wei; Jiang, Wenhao; Zhang, Feng

doi:10.1609/aaai.v33i01.33018909

Cited by 29 publications

(18 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yu et al (2017) and Wang et al (2019) additionally tackle the problem of selecting photos from an ordered stream. Yu et al (2017) compute a photo's attention and pick photos with the highest probability of inclusion, while Wang et al (2019) introduce a scene encoder which can determine when the current photo describes the start of a new scene. Apart from this, Wang et al (2019) is an example of an end-to-end trainable encoder-decoder approach, using GRUs.…”

Section: Direct Deep Learning Without Intermediatesmentioning

confidence: 99%

“…2.3.1 which have no intermediate data like topic words, or scene graph. We have some further similarity with Wang et al (2019) because it uses attention on the photos. However, none of the above works separately models the text semantics at sentence-level also and word-level; this is something we are introducing to improve the coherence of the output story.…”

Section: Reinforcement Learningmentioning

confidence: 99%

See 1 more Smart Citation

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Dai

Guérin

et al. 2021

Computer Speech & Language

View full text Add to dashboard Cite

Visual storytelling is a creative and challenging task, aiming to automatically generate a storylike description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

show abstract

Section: Direct Deep Learning Without Intermediatesmentioning

confidence: 99%

Section: Reinforcement Learningmentioning

confidence: 99%

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Dai

Guérin

et al. 2021

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…HPSR (Wang et al 2019): HPSR is a model includes the hierarchical photo-scene encoder, decoder, and reconstructor.…”

Section: Models For Comparisonmentioning

confidence: 99%

“…Due to the bias can be brought by the hand-coded evaluation metrics, Wang et al (2018b) proposes an adversarial reward learning framework to uncover a reward function from human demonstrations. Wang et al (2019) propose a model with a hierarchical photo-scene encoder and a re-constructor. Huang et al (2019) develops a hierarchically reinforcement learning approach, which introduces a local semantic concept to model.…”

Section: Related Workmentioning

confidence: 99%

Storytelling from an Image Stream Using Scene Graphs

Wang

Wei

et al. 2020

AAAI

View full text Add to dashboard Cite

Visual storytelling aims at generating a story from an image stream. Most existing methods tend to represent images directly with the extracted high-level features, which is not intuitive and difficult to interpret. We argue that translating each image into a graph-based semantic representation, i.e., scene graph, which explicitly encodes the objects and relationships detected within image, would benefit representing and describing images. To this end, we propose a novel graph-based architecture for visual storytelling by modeling the two-level relationships on scene graphs. In particular, on the within-image level, we employ a Graph Convolution Network (GCN) to enrich local fine-grained region representations of objects on scene graphs. To further model the interaction among images, on the cross-images level, a Temporal Convolution Network (TCN) is utilized to refine the region representations along the temporal dimension. Then the relation-aware representations are fed into the Gated Recurrent Unit (GRU) with attention mechanism for story generation. Experiments are conducted on the public visual storytelling dataset. Automatic and human evaluation results indicate that our method achieves state-of-the-art.

show abstract

“…HPSR (Wang et al, 2019a): It introduces an additional RNN stacked on the RNN-based photo encoder to detect the scene change. Information from both RNNs are fed into an RNN for story generation.…”

Section: Models For Comparisonmentioning

confidence: 99%

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Wang

Wei

Cheng

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Visual storytelling aims to generate a narrative paragraph from a sequence of images automatically. Existing approaches construct text description independently for each image and roughly concatenate them as a story, which leads to the problem of generating semantically incoherent content. In this paper, we propose a new way for visual storytelling by introducing a topic description task to detect the global semantic context of an image stream. A story is then constructed with the guidance of the topic description. In order to combine the two generation tasks, we propose a multi-agent communication framework that regards the topic description generator and the story generator as two agents and learn them simultaneously via iterative updating mechanism. We validate our approach on VIST dataset, where quantitative results, ablations, and human evaluation demonstrate our method's good ability in generating stories with higher quality compared to state-of-the-art methods.

show abstract

Hierarchical Photo-Scene Encoder for Album Storytelling

Cited by 29 publications

References 27 publications

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling

Storytelling from an Image Stream Using Scene Graphs

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Contact Info

Product

Resources

About