We propose a new approach for 3D reconstruction of dynamic indoor and outdoor scenes in everyday environments, leveraging only cameras worn by a user. This approach allows 3D reconstruction of experiences at any location and virtual tours from anywhere. The key innovation of the proposed ego-centric reconstruction system is to capture the wearer's body pose and facial expression from near-body views, e.g. cameras on the user's glasses, and to capture the surrounding environment using outward-facing views. The main challenge of the ego-centric reconstruction, however, is the poor coverage of the near-body views - that is, the user's body and face are observed from vantage points that are convenient for wear but inconvenient for capture. To overcome these challenges, we propose a parametric-model-based approach to user motion estimation. This approach utilizes convolutional neural networks (CNNs) for near-view body pose estimation, and we introduce a CNN-based approach for facial expression estimation that combines audio and video. For each time-point during capture, the intermediate model-based reconstructions from these systems are used to re-target a high-fidelity pre-scanned model of the user. We demonstrate that the proposed self-sufficient, head-worn capture system is capable of reconstructing the wearer's movements and their surrounding environment in both indoor and outdoor situations without any additional views. As a proof of concept, we show how the resulting 3D-plus-time reconstruction can be immersively experienced within a virtual reality system (e.g., the HTC Vive). We expect that the size of the proposed egocentric capture-and-reconstruction system will eventually be reduced to fit within future AR glasses, and will be widely useful for immersive 3D telepresence, virtual tours, and general use-anywhere 3D content creation.
We present a novel method to generate plausible diffraction effects for interactive sound propagation in dynamic scenes. Our approach precomputes a diffraction kernel for each dynamic object in the scene and combines them with interactive ray tracing algorithms at runtime. A diffraction kernel encapsulates the sound interaction behavior of individual objects in the free field and we present a new source placement algorithm to significantly accelerate the precomputation. Our overall propagation algorithm can handle highly-tessellated or smooth objects undergoing rigid motion. We have evaluated our algorithm's performance on different scenarios with multiple moving objects and demonstrate the benefits over prior interactive geometric sound propagation methods. We also performed a user study to evaluate the perceived smoothness of the diffracted field and found that the auditory perception using our approach is comparable to that of a wave-based sound propagation method.
Figure 1: We synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. We extract the affective cues from the speech, the sentiments from the corresponding text transcripts, the individual speaker styles, and the joint-based affective expressions from the seed poses (shown on the left). We train a generative adversarial network to synthesize gestures aligned with the speech by leveraging the affective information in both the generation and the discrimination phases. We show two such affective gestures on the right, with the affects furious and appalled denoted in italics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.