### Face landmark tracking evaluation

I recently came across a quite original paper by Li et al. [1] on face landmark detection. The method is quite different from others in the field as it uses classification, and a one examplar class, which is quite surprising. But this blog post is not about this method (maybe later...), what got me started however are the results. I thought it was time to compare the Wisimage Landmark Tracker (WLT) to the state-of-the-art.

Face landmark detection is a very important topic in any face-related application, and as such has raised considerable interest in the computer vision community. The reason for this is that face landmark detection is usually used as a face alignment method, registering the face in a frontal position, so that more specific methods can be applied on this normalized face. Example of these methods are face recognition, gender recognition, emotion/expression classification... so quite important and popular applications. Of course, the more precise this first registration step is, the more precise the final classification/recognition is. You can then understand how important this is to get this first step right. And since this step is shared by almost all face applications, well, you get a very hot topic indeed.
Now, not only can it be used as a face alignement, but face landmarks have an interest in their own rights. Augmented reality applications that modify your face or add any funny accessories like these snapchat filters that you probably know. If you don't, ask a teenager around.
At Wisimage, we are interested in  a bit more serious stuff than fluffy bunny ears on your co-worker's face. Realistic looking virtual make-up is what we are after. What is the difference you may ask ? Well the key difference is precision. It is surely not very damaging if your bunny ears are a little off, however if your lipstick is not spot-on, it will look as being smeared and kind of disgusting, not mentioning seriously breaking the immersion, since it will look fake.

 Fig.1 Detected landmarks on a test video of the 300VW dataset

Thankfully, thanks to the popularity of landmark detection, huge test datasets exist. One of the more prominent dataset  and especially relevant to realtime augmented reality applications is the 300VW dataset by Shen et al.[3]. An example of detection by the Wisimage Landmark Tracker (WLT) is given in Fig1. on the first image of one of the test video of the 300 VW benchmark. Looks good, right ? But is it really ? and how good is it ? Well, on Fig.2 I am also plotting the ground truth (in red), and as you can see there are quite a few differences.

 Fig.2 detection (green) and ground truth (red)

So, following the methodology proposed in the 300VW benchmark [2], we computed the ROC curves for the 3 video categories, ranging from the easiest ( 1 : naturalistic and well-lit) to the hardest (3: unconstrained). The error between the detection and the ground truth in a single frame can be computed using the Diagonally Normalized point-to-point error  $err(x, gt) = 1/d_{diag} ||x-gt||$ where $d_{diag}=\sqrt(w^2+h^2)$ with $(w,h)$ the size of the enclosing bounding box. And you average this over all the frames of the category.

There are several normalization schemes for computing this error. This is a bit unfortunate  since it reduces the possibility to compare with other methods. However Chrysos et al.[2] have made a very thorough job of comparing standards methods in different situations with this measure, and this is the recommended normalization for 300VW evaluations, so we stick to it. Results are presented as Cumulative Error Distribution curves (CED Curves). These curves are obtained by computing the proportion of images under a certain threshold error, and making this threshold vary from 0 to 0.08.

 Fig.3 CED Curves for category 1 (easiest)
Fig.3 shows the the results on Category 1, the easiest. I borrowed the same presentation and results as in [1], and took the best and worst method (Yang and Uricar), and then plotted our results. Note that results are computed on all frames, using a detection setup, as stated in [1].
To remove the effect of face detection failure, we build the 300VWCropped dataset, where faces are cropped using a bounding box computed from the ground truth label. Fig.4. is an example of such a cropped image.
 Example of cropped image in 300VWCropped

Our own results come in 3 flavors:
• WLTCropped-GTBox :No face detection, the ground truth bounding box is used as initialization.
• WLTCropped-OpenCV : OpenCVface detector is used on the cropped image
• WLT-OpenCV:  OpenCVface detector is used on the original image. Tracking with reinitialisation is then used (Experiment 4 of Chrysos et al. [2])
Of course, WLTCropped-GTBox is a bit of a cheat since the ground truth is used for initialisation (in WLT's case, this is used to initialize the mean shape at a rather comfortable position). However, this is still an interesting measure to check the robustness against pose and expression. WLTCropped-OpenCV introduces a bit of noise in the bounding box position, so results are a bit worse, but not too much.
It is rather interesting to see how dramatic the difference is with WLT-OpenCV, so we can see that the difficulty in 300VW comes rather from face localization than extreme poses, expressions or appearances.

Fig.5 gives the results on the hardest Category. As expected,  WLT-OpenCV gets even lower because of a high number of face miss by the OpenCV face detector.
 Fig.5 CED Curves for category 3 (hardest)

We are rather statisfied with the performances of the WLT, especially since it runs smoothly on low power mobile platforms. However, and similarly to the study of Chrysos et al. [2], and other studies that have shown the sensibility of cascaded regression methods to initialization, we know that we have to be extra careful with face initialization, and the choice of the face detector.

In a next blog post, I will address the shortcomings of this type of evaluation, especially regarding virtual make-up.

References

[1] Mengtian Li, Laszlo Jeni, and Deva Kannan Ramanan. Brute-force facial landmark analysis with a 140,000-way classifier. In AAAI 2018, February 2018.
[2] Grigorios G. Chrysos, Epameinondas Antonakos, Patrick Snape, Akshay Asthana, and Stefanos Zafeiriou. A comprehensive performance evaluation of deformable face tracking “in-the-wild”. International Journal of Computer Vision, 126(2):198–232, Apr2018.
[3] J.Shen, S.Zafeiriou, G. S. Chrysos, J.Kossaifi, G.Tzimiropoulos, and M.Pantic. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In IEEE International Conference on Computer Vision Workshops (ICCVW), 2015. IEEE, 2015.