Promoting Single-Modal Optical Flow Network for Diverse Cross-Modal Flow Estimation

Zhou, Shili; Tan, Weimin; Yan, Bo

doi:10.1609/aaai.v36i3.20268

Cited by 7 publications

(3 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The objective here is to assess the impact of incorporating video frames as conditioning signals in additional to text. For the task of dance-to-music generation, we further compare V2Meow with baseline models D2M-GAN (Zhu et al 2022a), CDCD Step-Intra (Zhu et al 2022b), and CMT (Di et al 2021) on the AIST++ test split, aiming to evaluate V2Meow's understanding of complex dance motion. Detailed results are presented in Table 1 and Table 2.…”

Section: Resultsmentioning

confidence: 99%

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Su,

Li,

Huang

et al. 2024

AAAI

View full text Add to dashboard Cite

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.

show abstract

Section: Resultsmentioning

confidence: 99%

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Su,

Li,

Huang

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Then, the field made significant progress when RAFT (Teed and Deng 2020) proposed a new recurrent optical flow network to estimate optical flow. Based on this breakthrough architecture, many recurrent networks (Jiang et al 2021b;Luo et al 2022b;Sui et al 2022;Xu et al 2021;Zhang et al 2021;Zheng et al 2022;Zhou et al 2023) have been proposed. For example, GMA (Jiang et al 2021a) suggested combining global motion to solve the problem of estimating occlusion, and KPA-Flow (Luo et al 2022a) designed kernel patch attention to deal with the local relationships of optical flow.…”

Section: Related Workmentioning

confidence: 99%

Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation

Cheng,

He,

Jiang

et al. 2024

AAAI

View full text Add to dashboard Cite

Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a Context-Aware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of iterations per sample. The policy network achieves this by learning contextual information to realize whether flow improvement is bottlenecked or minimal. On the one hand, we use iteration embedding and historical hidden cell, which include previous iterations information, to convey how flow has changed from previous iterations. On the other hand, we use the incremental loss to make the policy network implicitly perceive the magnitude of optical flow improvement in the subsequent iteration. Furthermore, the computational complexity in our dynamic network is controllable, allowing us to satisfy various resource preferences with a single trained model. Our policy network can be easily integrated into state-of-the-art optical flow networks. Extensive experiments show that our method maintains performance while reducing FLOPs by about 40%/20% for the Sintel/KITTI datasets.

show abstract

“…Being popular in self-supervised learning, contrastive learning (CL) allows models to learn the knowledge behind data without explicit labels (Xia et al 2022;Zhu et al 2023). It aims to bring an anchor (i.e., data sample) closer to a positive/similar instance and away from many negative/dissimilar instances, by optimizing their mutual information in the embedding space.…”

Section: Contrastive Learningmentioning

confidence: 99%

A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation

Wang,

Liu,

Huang

et al. 2024

AAAI

View full text Add to dashboard Cite

Therapeutic peptides represent a unique class of pharmaceutical agents crucial for the treatment of human diseases. Recently, deep generative models have exhibited remarkable potential for generating therapeutic peptides, but they only utilize sequence or structure information alone, which hinders the performance in generation. In this study, we propose a Multi-Modal Contrastive Diffusion model (MMCD), fusing both sequence and structure modalities in a diffusion framework to co-generate novel peptide sequences and structures. Specifically, MMCD constructs the sequence-modal and structure-modal diffusion models, respectively, and devises a multi-modal contrastive learning strategy with inter-contrastive and intra-contrastive in each diffusion timestep, aiming to capture the consistency between two modalities and boost model performance. The inter-contrastive aligns sequences and structures of peptides by maximizing the agreement of their embeddings, while the intra-contrastive differentiates therapeutic and non-therapeutic peptides by maximizing the disagreement of their sequence/structure embeddings simultaneously. The extensive experiments demonstrate that MMCD performs better than other state-of-the-art deep generative methods in generating therapeutic peptides across various metrics, including antimicrobial/anticancer score, diversity, and peptide-docking.

show abstract

Promoting Single-Modal Optical Flow Network for Diverse Cross-Modal Flow Estimation

Cited by 7 publications

References 33 publications

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation

A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation

Contact Info

Product

Resources

About