Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021
DOI: 10.1145/3404835.3462806
|View full text |Cite
|
Sign up to set email alerts
|

Towards Multi-Modal Conversational Information Seeking

Abstract: Recent research on conversational information seeking (CIS) mostly focuses on uni-modal interactions and information items. This perspective paper highlights the importance of moving towards developing and evaluating multi-modal conversational information seeking (MMCIS) systems as they enable us to leverage richer context, overcome errors, and increase accessibility. We bridge the gap between the multi-modal and CIS research and provide a formal definition for MMCIS. We discuss potential opportunities and res… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 29 publications
(25 citation statements)
references
References 55 publications
0
25
0
Order By: Relevance
“…Section 2 will focus on this formulation, starting with a brief introduction on conversational information seeking (Section 2.3). This includes a discussion of different modalities' (that is, text, speech, or multi-modal) impact on the seeking process, as for instance studied by Deldjoo et al (2021). We then continue with the topic of conversational search and its various proposed definitions (Section 2.5), culminating with one that relates CIS to many other related settings (Anand et al, 2020).…”
Section: Applicationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Section 2 will focus on this formulation, starting with a brief introduction on conversational information seeking (Section 2.3). This includes a discussion of different modalities' (that is, text, speech, or multi-modal) impact on the seeking process, as for instance studied by Deldjoo et al (2021). We then continue with the topic of conversational search and its various proposed definitions (Section 2.5), culminating with one that relates CIS to many other related settings (Anand et al, 2020).…”
Section: Applicationsmentioning
confidence: 99%
“…Users can interact with a conversational system through a range of input devices, including keyboards for typing, microphones for speech, Draft Version 1.0 smartphones for touch, or through a mixture of these and other input devices (Deldjoo et al, 2021). Using a mixture of modalities offer numerous benefits.…”
Section: Interaction Modality and Language In Conversationmentioning
confidence: 99%
“…Vision Embedding. Different from extracting object-level features as vision features [9,10,11], we use ViT [17] as a backbone network to process images, which is faster than object detectors. According to ViT, Patch Embedding will split the height and width of the input image I ∈ R H×W ×C to N = HW P 2 patches according P, and then flatten and reshape the patches to v ∈ R N ×(P 2 ×C) through linear transformation.…”
Section: Embedddingmentioning
confidence: 99%
“…So motivated, multimodal tasks have recently gained increasing popularity, especially in the fields of vision and language. At present, popular visual and language tasks include Visual Caption (VC) [4,5], Visual Grounding [6,7], Visual Question Answering (VQA) [4,7,8] and Visual Dialog (VD) [9,10,11]. VQA attempts to predict a correct answer to questions given some background texts and images.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation