We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and realworld samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.
We have performed first-principle density functional theory calculations to investigate how a subsurface transition metal M (M = Ni, Co, or Fe) affects the energetics and mechanisms of oxygen reduction reaction (ORR) on the outermost Pt mono-surface layer of Pt/M(111) surfaces. In this work, we found that the subsurface Ni, Co, and Fe could down-shift the d-band center of the Pt surface layer and thus weaken the binding of chemical species to the Pt/M(111) surface. Moreover, the subsurface Ni, Co, and Fe could modify the heat of reaction and activation energy of various elementary reactions of ORR on these Pt/M(111) surfaces. Our DFT results revealed that, due to the influence of the subsurface Ni, Co, and Fe, ORR would adopt a hydrogen peroxide dissociation mechanism with an activation energy of 0.15 eV on Pt/Ni(111), 0.17 eV on Pt/Co(111), and 0.16 eV on Pt/Fe(111) surface, respectively, for their rate-determining O2 protonation reaction. In contrast, ORR would follow a peroxyl dissociation mechanism on a pure Pt(111) surface with an activation energy of 0.79 eV for its rate-determining O protonation reaction. Thus, our theoretical study explained why the subsurface Ni, Co, and Fe could lead to multi-fold enhancement in catalytic activity for ORR on the Pt mono-surface layer of Pt/M(111) surfaces.
Improving the efficiency of electrocatalytic reduction of oxygen represents one of the main challenges for the development of renewable energy technologies. Here, we report the systematic evaluation of Pt-ternary alloys (Pt3(MN)1 with M, N = Fe, Co, or Ni) as electrocatalysts for the oxygen reduction reaction (ORR). We first studied the ternary systems on extended surfaces of polycrystalline thin films to establish the trend of electrocatalytic activities and then applied this knowledge to synthesize ternary alloy nanocatalysts by a solvothermal approach. This study demonstrates that the ternary alloy catalysts can be compelling systems for further advancement of ORR electrocatalysis, reaching higher catalytic activities than bimetallic Pt alloys and improvement factors of up to 4 versus monometallic Pt.
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.