This talk was given at the ECVP meeting in Strasbourg, 13
September 1996.
What distinguishes the face of my friend Oliver from the face of my friend Stanley (slide)? What makes this face looking male and this face looking female?
A crucial step towards answering such questions is an appropriate description of the faces or - as we want to call it - an appropriate representation. For a lot of purposes, it would be desirable to dispose of a representation that makes it possible to treat the faces as objects in a linear vector space. Such a "face space" is the basis for developing metrics that correspond to perceptual differences in identity, gender, age, etc.
How should such a face space look like? What kind of requirements should it fulfil? Here are three main issues that I consider to be important and useful:
1. Convexity: We expect the set of faces in that space to be convex. If two points in the space are corresponding to faces, any point on the line connecting these two points should also be a proper face.
2. Efficiency: The representation should provide an efficient coding scheme. Redundancies should be eliminated.
3. Separability: Important attributes such as identity, gender, age, facial expression, orientation, should be easily separable.
In this presentation, I want to contrast the simple and straightforward pixel-based representation with a representation that we developed and that we call the "correspondence-based representation" for reasons that will become clear later on. I will briefly show how the correspondence based representation is derived from the image of a face and it will become directly evident that it fulfils the requirement of convexity. Although we did a lot of work in that field (e.g. Vetter & Troje, 1995), I will not speak today about issues concerning the coding abilities of the correspondence-based representation. Instead, I will concentrate on the last requirement -- the separability of important attributes using sex classification as an example.
The simplest way to code the image of a face is by just concatenating all the intensity values in the image yielding a long vector with as many entries as the image contains pixels. We call this the "pixel-based-representation" of the image of a human face (slide). In the past, several people (Sirovich & Kirby, 1987; Turk & Pentland, 1991; O'Toole et al., 1993) worked with that kind of representation, performing linear statistics, for instance PCA, as a basis for identification and classification networks.
However, the pixel-based representation has disadvantages. The set of faces in such a space is not closed with respect to linear operations. This becomes clear if we look at the image defined by the mean of two face vectors as an example for a simple linear operation. The mean of two faces is not a single mean face but rather two superimposed faces (slide).
It might be argued that this is only due to the misalignment of the two faces. This is in deed the problem, but solving it is everything but trivial. In the next slide, I did align the two images with respect to a single point in the middle between the eyes.
If I use these aligned images as representations for our faces and try to synthesize the mean face, the result is somewhat better. The eyes match more or less, but all other parts, in particular the mouths, do not (slide). To match the mouths by keeping the eyes matched, a scaling operation is needed in addition to the translation. This scaling has to be done independently in the horizontal and vertical direction, leading not only to a change in size, but also to a distortion of the face.
The next slide shows the representation after this improved alignment. The first part of the vector encodes the image resulting from the alignment process. The last four coefficients account for the translation and scaling operations needed for the alignment. Note that we did not enter the translation and scaling factors themselves but their inverse values. The original image can thus be reconstructed from the vector representation by drawing the image encoded in the first part of the vector and then performing translation and scaling operations according to the last part of the vector. The simple mean of the two sample faces using this latter representation is much better now. Not only the eyes are aligned but - at least roughly - the mouths are aligned as well. However, a more careful look still reveals significant errors. Since the shapes of the two mouths were very different, a closer look shows that there are still two superimposed mouths rather than one. The noses are not aligned and other features including the outline of the face are still not matched.
Continuing with this approach leads to our "correspondence-based representation". Instead of allowing only simple image operations such as translation or scaling, any image deformation can be used in order to align a sample face with a second face that serves as a common prototype. The first part of the vector again depicts the image resulting after aligning the face to the prototype. The second part of the vector accounts for the deformation that has to be applied to this image in order to recover the original image. The deformation is not encoded in a parameterized form but simply by describing for each pixel i in the image the displacement dxi, dyi. The rule to decode the image from this representation simply reads as follows: Draw the image given by the first part of the vector and apply the deformation field given by the second part of the vector. We will refer to the information contained in the first part of the vector as the texture of the face and the information contained in the second part as the shape of the face.
The correspondence-based representation is completely smooth and convex in the sense that a linear combination of two or more faces cannot be identified as being synthetic. The mean of two faces results in a face with the mean texture applied to the mean shape (slide). In computer graphics this hybrid face is often referred to as the morph between the two faces.
In fact, any linear combination of existing textures reveals a new valid texture and any linear combination of existing shapes reveals a new valid shape. Furthermore, any valid texture can be combined with any valid shape to reveal a new face. The next slide shows some simple examples for this. We have already seen the first one in which we combined the mean texture of the two example faces with the mean shape. We can as well combine the entire texture of one face with the entire shape of the other face or vice versa.
A critical point of this approach is establishing pixel-by-pixel correspondence between the sample face and the common prototype. It is beyond the scope of this talk to go into details. Here, I just want to say that we established correspondence by employing an adapted optical flow algorithm. For details, I refer to former publications (Vetter & Troje, 1995).
To demonstrate the convexity of the correspondence-based representation, to contrast it with the pixel-based representation, and to give at least a qualitative answer to the question what the average difference between a man and a woman is, I show you short movie sequences illustrating the faces along a straight line connecting the average man -- made from the 100 male faces of our data base -- with the average woman -- made from the 100 female faces in the data base. The first image will start at a point three times as far away from the mean face as the real mean male is. The sequence then passes the mean male and the mean of all faces, goes to the realistic mean woman and ends at a point that is three times as far away from the mean than the realistic mean woman (slide).
The first sequence (MPEG 50 K) does this in the pixel-based space. The images are blurry. The exaggerated man is surrounded by a white ring that would be even brighter, if one would go further away from the overall mean. The white ring as well as the white ghost-mouth above the main mouth of the man reflects the way how the difference between man and woman is expressed in this representation. Since female heads are in average smaller, the pixel intensities are black in regions in which there are still relatively bright pixels in male faces. Exaggerating the difference, results in the bright halo.
In the second sequence (MPEG 52 K), we move along the same line -- from a triple superman to a triple superwoman -- in the texture subspace of the correspondence-based representation. The shape remains all the time the same. The faces are perfectly aligned and the images are sharp.
The third sequence (MPEG 55 K) shows the change of the face along the same line within the shape subspace of the correspondence-based representation.
The last sequence (MPEG 58 K), finally, shows how the face changes along a line connecting the superman with the superwoman in the full correspondence-based space.
As an example for the ability to separate important attributes, I will show, how a simple linear classifier performs in sex classification using the different representations as an input.
The classifications were performed using 200 images of human faces. The images were rendered using three-dimensional head models collected with a 3D laser scanner (Cyberware). This allowed us to orient the head in space and exactly define the viewpoint before rendering the image. The alignment of the faces is important because it strongly affects the classification performance as you will see in a minute.
Our standard set of images was rendered after carefully aligning the head models in 3D by minimizing the sum-squared distances between corresponding locations of a set of selected features such as the pupils, the tip of the nose, and the corners of the mouth (slide). Images were black and white and had a size of 256x256 pixels with a resolution of 8 bit.
The classification was performed in the following way: The set of 200 images of frontal views of human faces was randomly separated into two sets, each containing 50 male and 50 female faces. One set always served as training set and the other one as test set. We always ran two simulations with the roles of the two sets being exchanged. All the results shown refer to the average results of these two reciprocal simulations.
A single simulation consisted of a training phase and a test phase (slide). For the training, we first performed a Karhunen-Loeve transformation of the 100 faces in the training set in order to reveal the 99 principal components. Note that this is only a linear axis transformation of the original data. Then, the training data were projected onto the principal components to reveal the coefficients in the principal component space. These coefficients were now used as input for a linear system that was supposed to result in +1 if the input face was male and -1 if the input face was female. We used the full set of coefficients, but also any low-dimensional subset consisting of only the first n coefficients, n being any number between 1 and 99. The linear system was optimized according to a least square criterion revealing a vector W containing a weight for each coefficient.
For the test, we projected the second set of faces onto the principal components revealed from the training set. The resulting coefficients were multiplied with the weights resulting from the training and the sign of the output was finally compared to the sign of the desired output (+1 for males, -1 for females). An error was recorded, if the signs were not the same.
The first series of simulations were accomplished using our standard set of images showing the carefully 3D aligned faces. The results of a simulation on the pixel-based representation of the data are shown in the following slide. The curve plots the classification error against the number of coefficients used. If all 99 coefficients are used, an error rate of 3% is reached, using only 50 coefficients the error rate is 4%.
This is already a very good classification performance. In deed, performance doesn't get much better using the correspondence-based representation. The next slide shows the same data as before but also the curves corresponding to simulations in which we used either only the texture part of the correspondence-based representation, only the shape part of the correspondence-based representation or a combination of both the texture and the shape part. Using only the texture reveals a slightly worse result than using the pixel-based representation. Using only the shape improves the performance a bit and using both texture and shape reveals an error rate better than 2% when using 50 principal components. When using more than 80 principal components, the generalization error again increases due to overtraining.
The advantage of the correspondence-based representation is not very impressive here. However, it gets much more pronounced if we used images of faces that are slightly misaligned with respect to the optimal alignment that had been used before. For the next series of simulations, we used images that were rendered from our head models after systematically misaligning the faces by adding Gaussian noise with a standard deviation of 3 degrees to the orientation of the head in space.
The next slide shows how this changes the performance. Performance for the correspondence-based representation is virtually not changing. The error rate for the pixel-based representation, however, almost doubles.
For a third series of simulations, I used a set of images derived from misaligning the faces by adding Gaussian noise with a standard deviation of 5 pixels (corresponding to a shift of 5 mm of the realistically sized head) to the position of the face in the image plane.
The results are shown in the next slide. Again the curves showing the error rates using the correspondence-based representation do not change. For the pixel-based representation, however, the error rate is now higher then 12%.
I showed a way how to use knowledge about the correspondence between faces to establish a representation that separates texture and shape information.
This representation is convex and thus serves as a basis for a linear face space.
Sex classification using a linear classifier and the correspondence-based representation as input reveals a generalization error of only 2%.
O'Toole, A.J., Abdi, H., Deffenbacher, K.A. and Valentine, D. (1993) Low-dimensional representation of faces in higher dimensions of the face space. J. Opt. Soc. Am. A, 10:405-411.
Sirovich L. and Kirby, M. (1987) Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A, 4:519-554.
Turk, M. and Pentland, A. (1991) Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3:71-86, 1991.
Vetter, T. and Troje, N. (1995) Separation of texture and
two-dimensional shape in images of human faces. In: Sagerer, G.,
Posch, S. and Kummert F., Mustererkennung 1995, Reihe Informatik
aktuell, pp. 118-125, Springer Verlag.
For questions, suggestions, and any kind of comments contact: niko@mpik-tueb.mpg.de