
A block diagram of the Facial Animation Engine
The core of the Facial Animation Engine (see fig. 2) is the Animation Module; it is responsible for the conversion of the semantic information associated to a 3D model into animation rules. Any animation rule is function of a Facial Animation Parameter (FAP) defined by MPEG-4 and it is created according to the standard specifications (Simple FA Profile).

The virtual faces are represented with triangular meshes, created with any 3D authoring tool. A semantic description is then associated to the geometric description and provided to the animation software. Almost any model, either human or cartoon-like, can be animated. The semantic description is used by the animation software to create the animation rules. The semantic description is at very hi-level since it contains just indices of vertices. This means that animating a virtual face is now as simple as labeling vertices of a 3D model, and requires almost no expertise in facial animation.
MPEG-4 codec, audio codec and the Facial Animation Engine
Though the core of the FAE can be fully compliant with the MPEG-4 Simple FA Profile specifications, the software is provided with a proprietary Animation Parameter Decoder. On Windows (TM) platforms the FAE uses the TrueSpeech 8.5 (TM) audio decoder, for voice compression at 8 kbps. On different platforms, efficient techniques for audio encoding can be supported but are not provided. Contact us if you are interested in using the core of the FAE for developing MPEG-4 compliant products. Through a fruitful collaboration between EPTAMEDIA and another Italian company, bSoft, we can provide support and products on several aspects of the MPEG-4 standard.Creation of Animation Sequences
The are several ways to create animation sequences. Figure 3 shows different approaches for the creation of animation sequences.

In the first case (top) a Text-to-Speech synthesizer (TTS) is used to create
synthetic audio from plain text. Together with the audio samples, the TTS
provides the sequence of pronounced phonemes and their duration. This
information is used by a proprietary phoneme-to-FAP converter to infer the
mouth movements corresponding to the pronounced text.
In the second case (middle), natural audio is used as input. By processing the audio
with a phoneme recognizer, the sequence of pronounced phonemes can be obtained and, from
that, the mouth movements.
The third case (bottom) makes use of dedicated tracking hardware, capable
of capturing audio and facial movements of a real actor. In this case, the
captured facial movements are encoded into animation parameter and then used
to drive the virtual face. The quality of this last approach is obviously
higher if compared with the other solutions; in addition, facial expression
are also captured and synthesized, unlike the former cases, where only
mouth movements can be synthesized.