Towards Multi-Modal Conversational Information Seeking

Deldjoo, Yashar; Trippas, Johanne R.; Zamani, Hamed

doi:10.1145/3404835.3462806

Cited by 29 publications

(25 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Section 2 will focus on this formulation, starting with a brief introduction on conversational information seeking (Section 2.3). This includes a discussion of different modalities' (that is, text, speech, or multi-modal) impact on the seeking process, as for instance studied by Deldjoo et al (2021). We then continue with the topic of conversational search and its various proposed definitions (Section 2.5), culminating with one that relates CIS to many other related settings (Anand et al, 2020).…”

Section: Applicationsmentioning

confidence: 99%

“…Users can interact with a conversational system through a range of input devices, including keyboards for typing, microphones for speech, Draft Version 1.0 smartphones for touch, or through a mixture of these and other input devices (Deldjoo et al, 2021). Using a mixture of modalities offer numerous benefits.…”

Section: Interaction Modality and Language In Conversationmentioning

confidence: 99%

See 1 more Smart Citation

Conversational Information Seeking

Zamani¹,

Trippas²,

Dalton³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Conversational information seeking (CIS) is concerned with a sequence of interactions between one or more users and an information system. Interactions in CIS are primarily based on natural language dialogue, while they may include other types of interactions, such as click, touch, and body gestures. This monograph provides a thorough overview of CIS definitions, applications, interactions, interfaces, design, implementation, and evaluation. This monograph views CIS applications as including conversational search, conversational question answering, and conversational recommendation. Our aim is to provide an overview of past research related to CIS, introduce the current state-of-the-art in CIS, highlight the challenges still being faced in the community. and suggest future directions.

show abstract

Section: Applicationsmentioning

confidence: 99%

Section: Interaction Modality and Language In Conversationmentioning

confidence: 99%

Conversational Information Seeking

Zamani¹,

Trippas²,

Dalton³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Vision Embedding. Different from extracting object-level features as vision features [9,10,11], we use ViT [17] as a backbone network to process images, which is faster than object detectors. According to ViT, Patch Embedding will split the height and width of the input image I ∈ R H×W ×C to N = HW P 2 patches according P, and then flatten and reshape the patches to v ∈ R N ×(P 2 ×C) through linear transformation.…”

Section: Embedddingmentioning

confidence: 99%

“…So motivated, multimodal tasks have recently gained increasing popularity, especially in the fields of vision and language. At present, popular visual and language tasks include Visual Caption (VC) [4,5], Visual Grounding [6,7], Visual Question Answering (VQA) [4,7,8] and Visual Dialog (VD) [9,10,11]. VQA attempts to predict a correct answer to questions given some background texts and images.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VU-BERT: A Unified framework for Visual Dialog

Ye¹,

Si²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modalityspecific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

show abstract