Shengju Qian scite author profile

InputFace Manipulation Results of Our Model Figure 1: Face manipulation results on in-the-wild samples via transferring knowledge learned from the CelebA dataset. The first column shows input images and the remainders are images generated by AF-VAE with target expression/rotation boundary maps as the condition. Note that the model is fine-tuned with movie clip frames from YouTube of 256 × 256 resolution. All the generated poses are unseen before. AbstractRecent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to lowresolution and lack of diversity.In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent 1 Work done during an internship at SenseTime Research. space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Fréchet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.

show abstract

On Efficient Transformer-Based Image Pre-training for Low-Level Vision

Li¹,

Lu²,

Qian³

et al. 2021

Preprint

View full text Add to dashboard Cite

Aggregation via Separation: Boosting Facial Landmark Detector With Semi-Supervised Style Translation

Qian

Sun

et al. 2019

View full text Add to dashboard Cite

Facial landmark detection, or face alignment, is a fundamental task that has been extensively studied. In this paper, we investigate a new perspective of facial landmark detection and demonstrate it leads to further notable improvement. Given that any face images can be factored into space of style that captures lighting, texture and image environment, and a style-invariant structure space, our key idea is to leverage disentangled style and shape space of each individual to augment existing structures via style translation. With these augmented synthetic samples, our semi-supervised model surprisingly outperforms the fully-supervised one by a large margin. Extensive experiments verify the effectiveness of our idea with state-of-the-art results on WFLW [69], 300W [56], COFW [7], and AFLW [36] datasets. Our proposed structure is general and could be assembled into any face alignment frameworks. The code is made publicly available at https://github.com/thesouthfrog/stylealign.

show abstract

Temporal Interlacing Network

Shao

Qian

Liu

2020

AAAI

View full text Add to dashboard Cite

For a long time, the vision community tries to learn the spatio-temporal representation by combining convolutional neural network together with various temporal models, such as the families of Markov chain, optical flow, RNN and temporal convolution. However, these pipelines consume enormous computing resources due to the alternately learning process for spatial and temporal information. One natural question is whether we can embed the temporal information into the spatial one so the information in the two domains can be jointly learned once-only. In this work, we answer this question by presenting a simple yet powerful operator – temporal interlacing network (TIN). Instead of learning the temporal features, TIN fuses the two kinds of information by interlacing spatial representations from the past to the future, and vice versa. A differentiable interlacing target can be learned to control the interlacing process. In this way, a heavy temporal model is replaced by a simple interlacing operator. We theoretically prove that with a learnable interlacing target, TIN performs equivalently to the regularized temporal convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on 6 challenging benchmarks. These results push the state-of-the-art performances of video understanding by a considerable margin. Not surprising, the ensemble model of the proposed TIN won the 1st place in the ICCV19 - Multi Moments in Time challenge. Code is made available to facilitate further research.1

show abstract

What Makes for Good Tokenizers in Vision Transformer?

Qian

Zhu

et al. 2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shengju Qian

Make a Face: Towards Arbitrary High Fidelity Face Manipulation

On Efficient Transformer-Based Image Pre-training for Low-Level Vision

Aggregation via Separation: Boosting Facial Landmark Detector With Semi-Supervised Style Translation

Temporal Interlacing Network

What Makes for Good Tokenizers in Vision Transformer?

Contact Info

Product

Resources

About