“…Given full-body scans as 3D supervision, [75,76,28,29,5] learned the SDFs or occupancy fields directly from images, which could predict photo-realistic human avatars in inference phrase. [70,81,49,69,45,31,12,88,99,63,102] leveraged the radiance field for more photo-realistic human avatars from multiview images or single-view videos without any 3D supervision. Although implicit representations improve reconstruction quality against explicit ones, they still have drawbacks, e.g., large computation burden or poor geometry.s Besides, volume rendering is incompatible with graphics hardware, thus the outputs are inapplicable in downstream applications without further post-processing.…”