Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Lee, Haeju; Kwon, Oh Joon; Park, Min‐Ho; Han, Ran; Kim, Yoonhyung; Kim, Jinhyeon; Lee, Youngjune; Shin, Haebin; Lee, Kangwook; Kim, Kee-Eung

doi:10.18653/v1/2022.findings-naacl.61

Cited by 4 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the model of the coreference challenge winner team (Lee et al, 2022) (MultiT askBART , 74% F1 ↑), a BART-based model (Lewis et al, 2020) trained to handle all challenge tasks. A pretrained ResNet model (He et al, 2016) encodes each object along with its non-visual attributes, a learnable embedding that is later mapped to match the dimension of BART.…”

Section: Language-vision-and-relationalmentioning

confidence: 99%

‘What are you referring to?’ Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

Chiyah-Garcia,

Suglia,

Eshghi

et al. 2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE 1 ): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall 2 .

show abstract

Section: Language-vision-and-relationalmentioning

confidence: 99%

‘What are you referring to?’ Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

Chiyah-Garcia,

Suglia,

Eshghi

et al. 2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

show abstract

“…QS Goal Diggers (Kottur et al 2021a) and Kakao Enterprise (Lee and Han 2021) directly insert visual attributes into models input, while Sogang University (Kottur et al 2021b) and A-STAR (Nguyen et al 2021) build a set of visual attributes prediction tasks in pre-training stage. KAIST (Lee et al 2022) designs an auxiliary task to predict visual attributes. However, less attention has been paid to building spatial relations between assets.…”

Section: Do You Have Any Clothes Match My New Bought Jeans ?mentioning

confidence: 99%

“…JMGPT (Kottur et al 2021b) and JointGM (Nguyen et al 2021) apply language model to predict visual attributes and system response jointly. MMBart (Lee et al 2022) adds embedded box coordinates to textual embedding as Transformer input and designs auxiliary tasks to predict visual attributes according to the output of encoder hidden states. We can find that their utilized spatial information is all from the bounding box.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

Long

Hui

Fulong

et al. 2023

AAAI

View full text Add to dashboard Cite

Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Increment Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets. We release our code and data at https://github.com/LYX0501/SPRING.

show abstract