Findings of the Association for Computational Linguistics: NAACL 2022 2022
DOI: 10.18653/v1/2022.findings-naacl.61
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Abstract: The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 13 publications
0
0
0
Order By: Relevance
“…We use the model of the coreference challenge winner team (Lee et al, 2022) (MultiT askBART , 74% F1 ↑), a BART-based model (Lewis et al, 2020) trained to handle all challenge tasks. A pretrained ResNet model (He et al, 2016) encodes each object along with its non-visual attributes, a learnable embedding that is later mapped to match the dimension of BART.…”
Section: Language-vision-and-relationalmentioning
confidence: 99%
“…We use the model of the coreference challenge winner team (Lee et al, 2022) (MultiT askBART , 74% F1 ↑), a BART-based model (Lewis et al, 2020) trained to handle all challenge tasks. A pretrained ResNet model (He et al, 2016) encodes each object along with its non-visual attributes, a learnable embedding that is later mapped to match the dimension of BART.…”
Section: Language-vision-and-relationalmentioning
confidence: 99%
“…QS Goal Diggers (Kottur et al 2021a) and Kakao Enterprise (Lee and Han 2021) directly insert visual attributes into models input, while Sogang University (Kottur et al 2021b) and A-STAR (Nguyen et al 2021) build a set of visual attributes prediction tasks in pre-training stage. KAIST (Lee et al 2022) designs an auxiliary task to predict visual attributes. However, less attention has been paid to building spatial relations between assets.…”
Section: Do You Have Any Clothes Match My New Bought Jeans ?mentioning
confidence: 99%
“…JMGPT (Kottur et al 2021b) and JointGM (Nguyen et al 2021) apply language model to predict visual attributes and system response jointly. MMBart (Lee et al 2022) adds embedded box coordinates to textual embedding as Transformer input and designs auxiliary tasks to predict visual attributes according to the output of encoder hidden states. We can find that their utilized spatial information is all from the bounding box.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation