Ludan Ruan scite author profile

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing. This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pretraining & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.

show abstract

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Ruan¹,

Ma²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Accommodating Audio Modality in CLIP for Multimodal Processing

Ruan

Song

et al. 2023

AAAI

View full text Add to dashboard Cite

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the state-of-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.The corresponding code and checkpoints will be released at https://github.com/ludanruan/CLIP4VLA.

show abstract

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Ruan¹,

Chen²,

Song³

et al. 2021

Preprint

View full text Add to dashboard Cite

Entities Object Localization (EOL) aims to evaluate how grounded or faithful a description is, which consists of caption generation and object grounding. Previous works tackle this problem by jointly training the two modules in a framework, which limits the complexity of each module. Therefore, in this work, we propose to divide these two modules into two stages and improve them respectively to boost the whole system performance. For the caption generation, we propose a Unified Multi-modal Pre-training Model (UMPM) to generate event descriptions with rich objects for better localization. For the object grounding, we finetune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful. Our overall system achieves the state-of-theart performances on both sub-tasks in Entities Object Localization challenge at Activitynet 2021, with 72.57 localization accuracy on the testing set of sub-task I and 0.2477 F1 all per sent on the hidden testing set of sub-task II.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ludan Ruan

Survey: Transformer based video-language pre-training

Survey: Transformer based Video-Language Pre-training

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Accommodating Audio Modality in CLIP for Multimodal Processing

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Contact Info

Product

Resources

About