Evjvqa Challenge: Multilingual Visual Question Answering

Luu-Thuy Nguyen, Ngan; Nghia Hieu Nguyen,; T.D. Vo, Duong; Tran, Khanh Quoc; Nguyen, Kiet Van

doi:10.15625/1813-9663/18157

Cited by 6 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Consistent with the preceding research conducted by [7,19,22,34], elucidating the evaluation metrics utilized for gauging the model's efficacy is crucial before delving into the analysis of the experimental outcomes. The appraisal in this research encompasses four pivotal performance metrics: F1 score, Precision, Recall, and Accuracy.…”

Section: Evaluation Metricsmentioning

confidence: 79%

See 1 more Smart Citation

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

Tran,

Phan,

Nguyen

et al. 2023

Preprint

View full text Add to dashboard Cite

In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English.To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across seven evaluation metrics.

show abstract

Section: Evaluation Metricsmentioning

confidence: 79%

“…Building on these foundational steps, Nguyen et al in 2022 [34] broke new ground by unveiling a multilingual dataset through a shared task. Notably, this dataset incorporates the Vietnamese language, broadening the scope of VQA research to delve into the Vietnamese linguistic setting.…”

Section: Related Workmentioning

confidence: 99%

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

Tran,

Phan,

Nguyen

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In our experiment, we used UIT-EVJVQA [26], the first mVQA dataset with three languages, including English and Vietnamese released by VLSP-2022 Organizers for EVJVQA challenge (https://vlsp.org.vn/vlsp2022/eval/evjvqa). This dataset includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image.…”

Section: Datasetmentioning

confidence: 99%

Ohyeah at Vlsp2022-Evjvqa Challenge: a Jointly Language-Image Model for Multilingual Visual Question Answering

Ngo Dinh,

Le Ngoc,

Quoc Phan

2023

JCC

View full text Add to dashboard Cite

Multilingual Visual Question Answering (mVQA) is an extremely challenging task which needs to answer a question given in different languages and take the context in an image. This problem can only be addressed by the combination of Natural Language Processing and Computer Vision. In this paper, we propose applying a jointly developed model to the task of multilingual visual question answering. Specifically, we conduct experiments on a multimodal sequence-to-sequence transformer model derived from the T5 encoder-decoder architecture. Text tokens and Vision Transformer (ViT) dense image embeddings are inputs to an encoder then we used a decoder to automatically anticipate discrete text tokens. We achieved the F1-score of 0.4349 on the private test set and ranked 2nd in the EVJVQA task at the VLSP shared task 2022. For reproducing the result, the code can be found at https://github.com/DinhLuan14/VLSP2022-VQA-OhYeah

show abstract

ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese

Tran,

Phan,

Van Nguyen

et al. 2024

Multimedia Systems

View full text Add to dashboard Cite

Evjvqa Challenge: Multilingual Visual Question Answering

Cited by 6 publications

References 22 publications

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

Ohyeah at Vlsp2022-Evjvqa Challenge: a Jointly Language-Image Model for Multilingual Visual Question Answering

ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese

Contact Info

Product

Resources

About