Yuhao Cui scite author profile

Visual Question Answering (VQA) requires a finegrained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective 'co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness.Experimental results demonstrate that MCAN significantly outperforms the previous state-ofthe-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.

show abstract

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Cui

Wang

et al. 2021

View full text Add to dashboard Cite

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cROSs-and InTrA-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-ofdomain datasets, ROSITA significantly outperforms existing stateof-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets. CCS CONCEPTS• Computing methodologies → Multi-task learning.

show abstract

Deep Multimodal Neural Architecture Search

Cui

et al. 2020

View full text Add to dashboard Cite

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradientbased NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-ofthe-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding. CCS CONCEPTS • Computing methodologies → Multi-task learning; Neural networks.

show abstract

Research on Image Colorization Algorithm Based on Residual Neural Network

Qin

Cheng

Cui

et al. 2017

View full text Add to dashboard Cite

A framework combining DNN and level-set method to segment brain tumor in multi-modalities MR image

Qin¹,

Zhang²,

Zeng³

et al. 2019

Soft Comput

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuhao Cui

Deep Modular Co-Attention Networks for Visual Question Answering

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Deep Multimodal Neural Architecture Search

Research on Image Colorization Algorithm Based on Residual Neural Network

A framework combining DNN and level-set method to segment brain tumor in multi-modalities MR image

Contact Info

Product

Resources

About