Multimodal Conversational AI: A Survey of Datasets and Approaches

Sundar, Anirudh; Heck, Larry

doi:10.18653/v1/2022.nlp4convai-1.12

Cited by 20 publications

(9 citation statements)

References 131 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…hallucinate facts is well documented [24]. As they are primarily text-based, their integration with non-linguistic knowledge sources may also be challenging, although there exists a number of approaches integrating neural models with knowledge bases [17,20] or image/videos [32,49]. Finally, even though few-shots learning approaches may be employed to mitigate problems of data scarcity [42,54], their portability to scenarios with no or limited data remains difficult.…”

Section: Modular Vs End-to-end Sdsmentioning

confidence: 99%

Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

Lison¹,

Kennington²

2023

Preprint

View full text Add to dashboard Cite

Software architectures for conversational robots typically consist of multiple modules, each designed for a particular processing task or functionality. Some of these modules are developed for the purpose of making decisions about the next action that the robot ought to perform in the current context. Those actions may relate to physical movements, such as driving forward or grasping an object, but may also correspond to communicative acts, such as asking a question to the human user. In this position paper, we reflect on the organization of those decision modules in human-robot interaction platforms. We discuss the relative benefits and limitations of modular vs. end-to-end architectures, and argue that, despite the increasing popularity of end-to-end approaches, modular architectures remain preferable when developing conversational robots designed to execute complex tasks in collaboration with human users. We also show that most practical HRI architectures tend to be either robot-centric or dialogue-centric, depending on where developers wish to place the "command center" of their system. While those design choices may be justified in some application domains, they also limit the robot's ability to flexibly interleave physical movements and conversational behaviours. We contend that architectures placing "action managers" and "interaction managers" on an equal footing may provide the best path forward for future human-robot interaction systems.

show abstract

Section: Modular Vs End-to-end Sdsmentioning

confidence: 99%

Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

Lison¹,

Kennington²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…So, unexploited potential exists in the study of multimodal conversational agents, which let users and conversational agents converse using both human language and visual information to be more realistic, human-like, and engaging. Sunder and Heck [10] has defined and mathematically formulated the goal of the multimodal conversational study. they suggested four basic problems in multimodal conversational systems: disambiguation, response generation, coreference resolution, and dialogue state tracking.…”

Section: Emoji Representation and Approaches Plays A Key Role In Mult...mentioning

confidence: 99%

“…So, they cannot understand the mood or tone of the user [9]. One more drawback of current conversational agents is that they converse only with language (text) whereas humans communicate with different modalities or senses [10]. Figure 1 shows an overview of the current state of conversational agents.…”

Section: Introductionmentioning

confidence: 99%

AI-Based Conversational Agents: A Scoping Review From Technologies to Future Directions

et al. 2022

View full text Add to dashboard Cite

Artificial intelligence is changing the world, especially the interaction between machines and humans. Learning and interpreting natural languages and responding have paved the way for many technologies and applications. The amalgam of machine learning, deep learning, and natural language processing helped Conversational Artificial Intelligence (AI) to change the face of Human-Computer Interaction (HCI). A conversational agent is an excellent example of conversational AI, which imitates the natural language. This article presents a sweeping overview of conversational agents that includes different techniques such as pattern-based, machine learning, and deep learning used to implement conversational agents. It also discusses the panorama of different tasks in conversational agents. This study also focuses on how conversational agents can simulate human behavior by adding emotions, sentiments, and affect to the context. With the advancements in recent trends and the rise in deep learning models, the authors review the deep learning techniques and various publicly available datasets used in conversational agents. This article unearths the research gaps in conversational agents and gives insights into future directions.

show abstract

“…At the core of these efforts, the ability to understand language and vision, as well as integrate both representations to align the linguistic expressions in the dialogue with the relevant visual concepts or perceived objects, is the key to multimodal dialogue understanding (Landragin, 2006;Loáiciga et al, 2021b,a;Kottur et al, 2018;Utescher and Zarrieß, 2021;Sundar and Heck, 2022;Dai et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

Lovenia¹,

Cahyawijaya²,

Fung³

2023

Preprint

View full text Add to dashboard Cite

The demand for multimodal dialogue systems has been rising in various domains, emphasizing the importance of interpreting multimodal inputs from conversational and situational contexts. One main challenge in multimodal dialogue understanding is multimodal object identification, which constitutes the ability to identify objects relevant to a multimodal user-system conversation. We explore three methods to tackle this problem and evaluate them on the largest situated dialogue dataset, SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by ∼20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and discussion regarding the limitation of our methods and the potential directions for future works. Our code is publicly available at https://github.com/holylovenia/ multimodal-object-identification.

show abstract

Multimodal Conversational AI: A Survey of Datasets and Approaches

Cited by 20 publications

References 131 publications

Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

AI-Based Conversational Agents: A Scoping Review From Technologies to Future Directions

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

Contact Info

Product

Resources

About