Virtual try-on has attracted lots of research attention due to its potential applications in e-commerce, virtual reality and fashion design. However, existing methods can hardly preserve the finegrained details (e.g., clothing texture, facial identity, hair style, skin tone) during generation, due to the non-rigid body deformation and multi-scale details. In this work, we propose a multi-stage framework to synthesize person images, where fine-grained details can be well preserved. To address the long-range translation and rich-details generation, we propose a Tree-Block (tree dilated fusion block) to replace standard ResNet-block where applicable. Notably, multi-scale feature maps can be smoothly fused for finegrained detail generation, by incorporating larger spatial context at multiple scales. With a delicate end-to-end training scheme, our whole framework can be jointly optimized for results with significantly better visual fidelity and richer details. Moreover, we also explore the potential application in video-based virtual try-on. By harnessing the well-trained image generator and an extra videolevel adaptor, a model photo can be well animated with a driving pose sequence. Extensive evaluations on standard datasets and user study demonstrate that our proposed framework achieves the stateof-the-art results, especially in preserving visual details in clothing texture and facial identity. Our implementation is publicly available via this URL.