2019
DOI: 10.1109/tpami.2018.2890628
|View full text |Cite
|
Sign up to set email alerts
|

Focal Visual-Text Attention for Memex Question Answering

Abstract: Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photo albums, we have to look at whole collections with sequences of photos. This paper proposes a new multimodal MemexQA task: given a sequence of photos from a user, the goal is to automatically answer questions that help users recover their memory about an event captured… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
4
2

Relationship

1
9

Authors

Journals

citations
Cited by 71 publications
(25 citation statements)
references
References 50 publications
0
25
0
Order By: Relevance
“…Apart from answering questions, J. Liang et al [86], have reported evidential image and text snippets to support the reasoning for the answer. The proposed attention-based neural network, Focal Visual-Text Attention network (FVTA) takes into consideration both visual and text sequence information.…”
Section: ) Encoder-decoder Based Methodsmentioning
confidence: 99%
“…Apart from answering questions, J. Liang et al [86], have reported evidential image and text snippets to support the reasoning for the answer. The proposed attention-based neural network, Focal Visual-Text Attention network (FVTA) takes into consideration both visual and text sequence information.…”
Section: ) Encoder-decoder Based Methodsmentioning
confidence: 99%
“…Multimodal Feature Composition. Another related area is multimodal feature composition which has been studied more extensively in other problems such as visual question answering [8,30,44,53], visual reasoning [28,60], image-to-image translation [34,83], etc. Specifically, our method is related to feature-wise modulation, a technique to modulate the features of one source by referencing those from the other.…”
Section: Related Workmentioning
confidence: 99%
“…The VQA task (Agrawal et al, 2015) requires answering a freeform natural language question about visual content in an image. Previous work has shown that models often do well on the task by exploiting language and dataset biases (Agrawal et al, 2017;Zhang et al, 2015;Ramakrishnan et al, 2018;Guo et al, 2019;Manjunatha et al, 2018 (Selvaraju et al, 2019a,b;Qiao et al, 2017;Liang et al, 2019), the multi-modal task of VQA has a language component which cannot always be explained visually, i.e., visual regions can be insufficient to express underlying concepts (Goyal et al, 2016;Hu et al, 2017). Park et al (2018) and Wu and Mooney (2019) generate textual justifications through datasets curated with human explanations.…”
Section: Related Workmentioning
confidence: 99%