Background

Worldwide interest on the possibility of using synthetic anatomic models of a human face to generate synthetic animation have been increasing in the last years. Multimedia applications requiring the use of a synthetic talking human face are requested in several fields like education, culture and entertainment.

Facial Animation: a brief overview

Human facial expression has been the subject of scientific investigation for more than one hundred years. Computer based facial expression modeling and animation is not a new endeavor. Initial efforts in this area go back well over 25 years. The earliest work with computer based facial representation was done in the early 1970's. Parke created the first three-dimensional facial animation in 1972. In 1973 Gillenson developed an interactive system to assemble and edit line drawn facial images. And in 1974, Parke proposed a parameterized three-dimensional facial model. The innovation introduced by Parke's model was the possibility to reshape the appearance of the face model simply by specifying a small set of parameters instead of the complete model geometry. The early 1980's saw the development of the first physically based muscle-controlled face model by Platt and the development of techniques for facial caricatures by Brennan. In 1985, the short animated film "Tony de Peltry" was a landmark for facial animation. In it, for the first time, computer facial expression and speech animation were a fundamental part of telling the story. In the late 1980's Waters proposed a new muscle based model, in which the animation proceed through the dynamic simulation of deformable facial tissues, with embedded contractile muscles of facial expression rooted in a skull substructure with a hinged jaw. In the same years there was the development of an approach to automatic speech synchronization by Lewis and by Hill. The 1990's have seen increasing activity in the development of facial animation techniques. At the UC Santa Cruz Perceptual Science Laboratory, Cohen has developed a visual speech synthesizer, a computer animated talking face, incorporating coarticulation (the interaction between nearby speech segments). In the latest years the use of computer facial animation as a key story telling component has been illustrated in the film "Toy Story" and "A Bugs Life" produced by Pixar, and "Ant Z" produced by Lucas Arts.

MPEG-4 Facial Animation: an overview

If past trends are a valid indicator of future research, the next decade should be a very exciting time to be involved in computer facial animation. And the first step for future facial animation system was definitely moved by MPEG. Though several nice-looking facial animation systems have been developed in the past years, they all suffered from a major limitation. Each of them, in fact, was based on proprietary architecture and syntax used to animate the synthetic face. In most of the cases, the facial animation systems were developed to fit and solve a specific application, without taking into account possible limitations or drawbacks that a particular architecture could encounter when used outside the specific application it was designed for. MPEG-4, the new ISO/IEC international standard defined in 1998, is trying to overcome all the divisions in the world of facial animation by defining a standard way to deal with synthetic faces. The big effort carried out during more then two years of work has led to the definition of a set of animation parameters and semantic rules which can be used to drive any synthetic face compliant with the standard. The first MPEG-4 concept is that of "feature points". A feature point describes a key point of the face (i.e. the corners of the lips, the tip of the nose, and so on). The feature points are used both to define the appearance of the face and to animate it. In fact, almost every Facial Animation Parameter (FAP), with some few exceptions, defines a mono-dimensional displacement of the feature point it is associated to. As mentioned before, this is true for the majority of the FAP; some of them describe rotations (like the eyeballs and the whole head) and some other can modify the whole appearance of the face (like reproducing an expression of joy). The second concept to describe is the "neutral face". The neutral face represents the reference posture of a synthetic face: the mouth is closed and the gaze is directed perpendicular to the screen plane, the eyes are open and the eyelids are tangent to the iris. The concept of the neutral face is fundamental: first because all the FAP describe displacements with respect to the neutral face and second because the neutral face is also used to normalize the FAP values. The normalization of FAP values is the third and last key concept. In order to define FAP that could be used on and extracted from any synthetic or real face, MPEG-4 had to solve the not simple problem of the normalization of the FAP. The proposed solution makes use of the Facial Animation Parameter Units (FAPU). A FAPU is the distance between some key facial points (i.e. the distance between the tip of the nose and the middle of the mouth, the distance between the eyes, etc.). Six FAPU have been defined. The value of a FAP is then expressed in terms of fractions of one of the FAPU. In this way, the amplitude of the movements described by the FAP is automatically adapted to the actual size/shape of the model to animate or to extract the FAP from. The 68 standard FAP are divided into two groups: hi-level and low-level FAP. If the low-level FAP have already been briefly described, some more word must be spent to the hi-level FAP. MPEG-4 has defined two hi-level FAP, namely viseme and expression. These two FAP can move at one time several feature points, modifying the whole face posture. In particular, a viseme is defined as the "visual correlated to a phoneme", and defines the mouth posture when pronouncing the corresponding phoneme. The meaning of the expression is straightforward and does not require further explanations. Viseme and expression can be combined in order to generate animation at extremely low bit rate (few hundreds of bits per second). Unlike the low-level FAP, whose effect is clearly stated in the standard, MPEG-4 does not specify how to interpret the hi-level FAP: this means that any facial animation decoder must have proprietary criteria to map a very limited amount of information (the FAP value) into very complex face postures. MPEG-4 doesn't specify anything on the face model resident on the decoder as well. Any facial animation decoder must have at least one proprietary face model. Typically, the model is represented with a mesh of polygons connecting vertices in the 3D space. The model can be of any arbitrary complexity, from few hundreds to few thousands of polygons, and driven by very simple or extremely sophisticated animation rules. Once again, the only requirement coming from MPEG-4 for interpreting the low-level FAP is that the corresponding feature points is moved to the location specified by the FAP. There are no constraints on the movements of the vertices of the model in the neighborhoods of the feature point. For instance, the movement of the mouth corner during a smile implies that part of the lips and the cheek is also moved in a natural way. The collection of these criteria for interpreting the FAP is proprietary of the decoder and can be referred to as the "animation rules". They can actually make the difference among facial animation decoders, being responsible for the final quality of the animation.

EPTAMEDIA and Facial Animation

EPTAMEDIA is a spin-off of DIST, the Dept. of Telecommunications, Computers and System Science of the University of Genoa (IT). With its participation to the MPEG meetings from 1996 to 1999, DIST has been involved in the definition of the MPEG-4 specification in the area of facial animation, bringing a significant contribution to the standardization activities. EPTAMEDIA recently signed an agreement with DIST for the commercial exploitation of products in the field of Facial Animation.