This paper presents an overview of some of the synthetic visual objects supported by MPEG-4 version-1, namely animated faces and animated arbitrary 2-D uniform and Delaunay meshes. We discuss both specification and compression of face animation and 2D-mesh animation in MPEG-4. Face animation allows to animate a proprietary face model or a face model downloaded to the decoder. We also address integration of the face animation tool with the text-to-speech interface (TTSI), so that face animation can be driven by text input.
IntroductionMPEG-4 is an object-based multimedia compression standard, which allows for encoding of different audiovisual objects (AVO) in the scene independently. The visual objects may have natural or synthetic content, including arbitrary shape video objects, special synthetic objects such as human face and body, and generic 2-D/3-D objects composed of primitives like rectangles, spheres, or indexed face sets, which define an object surface by means of vertices and surface patches. The synthetic visual objects are animated by transforms and special purpose animation techniques, such as face/body animation and 2D-mesh animation. MPEG-4 also provides synthetic audio tools such as structured audio tools and a text-to-speech interface (TTSI). This paper presents a detailed overview of synthetic visual objects supported by MPEG-4 version-1, namely animated faces and animated arbitrary 2-D uniform and Delaunay meshes. We also address integration of the face animation tool with the TTSI, so that face animation can be driven by text input. Body animation and 3-D mesh compression and animation will be supported in MPEG-4 version-2, and hence are not covered in this article.The representation of synthetic visual objects in MPEG-4 is based on the prior VRML standard [13][12][11] using nodes such as Transform, which defines rotation, scale or translation of an object, and IndexedFaceSet describing 3-D shape of an object by an indexed face set. However, MPEG-4 is the first international standard that specifies a compressed binary representation of animated synthetic audio-visual objects. It is important to note that MPEG-4 only specifies the decoding of compliant bit streams in an MPEG-4 terminal. The encoders do enjoy a large degree of freedom in how to generate MPEG-4 compliant bit streams. Decoded audio-visual objects can be composed into 2D and 3D scenes using the Binary Format for Scenes (BIFS) [13], which also allows implementation of animation of objects and their properties using the BIFS-Anim node. We recommend readers to refer to an accompanying article on BIFS for the details of implementation of BIFS-Anim. Compression of still textures (images) for mapping onto 2D or 3D meshes is also covered in another accompanying article. In the following, we cover the specification and compression of face animation and 2D-mesh animation in Sections 2 and 3, respectively.
Face AnimationMPEG-4 foresees that talking heads will serve an important role in future customer service applications. For examp...