This thesis is positioned at the intersection of several research fields, Natural Language Processing, Information Retrieval (IR) and Computer Vision, which have unified around representation learning and pre-training methods. We have defined and studied a new multimodal task: Knowledge-based Visual Question Answering about Named Entities (KVQAE). We were particularly interested in cross-modal interactions and different ways of representing named entities. We also focused on data used to train and, more importantly, evaluate Question Answering systems through different metrics. We annotated a dataset for this purpose, the first in KVQAE comprising various types of entities. We also defined an experimental framework for dealing with KVQAE in two stages through an unstructured knowledge base and identified IR as the main bottleneck of KVQAE, especially for questions about non-person entities. To improve the IR stage, we studied different multimodal fusion methods, which are pre-trained through an original task: the Multimodal Inverse Cloze Task. We found that these models leveraged a cross-modal interaction that we had not originally considered, and which may address the heterogeneity of visual representations of named entities. These results were strengthened by a study of the CLIP model, which allows this cross-modal interaction to be modeled directly.
Awarded by
: Université Paris-Saclay, Orsay, France on 8 November 2023.
Supervised by
: Olivier Ferret and Camille Guinaudeau.
Available at
: https://www.theses.fr/s247993.