Crack the secret of human facial part segmentation

In recent years, more and more people are amusing themselves by taking selfies with various applications on smartphone. Each of them provides huge amounts of visual effects on different facial parts eg. bigger eyes, virtual animal ears, fantasy background etc. A Chinese selfie APP named Meitu provides us how President Donald Trump will look like in fairy tales.
Image credit:

You might be still surprised by the display of these fascinating visual effects on your photo, or maybe you might have already take that for granted. However, researchers have spent decades to teach the machines to identify facial parts. One important subject is called facial part segmentation, which is called face parsing as well. The objective is to label each pixel with a class which indicates to which facial parts the pixel belongs to.

With the development of the deep learning technology, huge leaps on scene parsing and semantic segmentation have been made. A fundamental work is done by Jonathon Long et al. called fully convolutional neural network (FCNN) [1]. Lots of others follow the same path and build several other networks to achieve a better result. In general, almost all of these networks could be described as a combination of encoder and decoder as shown in the following figure.

Image credit:

The "encoder" part, which is referred as "convolutional network" in the figure, is designed to learn features on a face. The design follows a traditional neural network pattern with increasing feature map channels and decreasing feature map sizes as the layer goes deeper. On the other hand, the ''decoder" part, which is referred as "deconvolutional network" in the figure, is designed to be symmetric to the "encoder". The "decoder" network generally reforms a mask in the same size of the input which indicates the label of each pixel on the photo.

Like all of the deep learning applications, we need huge amounts of data to train our network so that it can finally learn to segment faces in different situations. Two public datasets named Helen[2] and LFW-PL[3] are available. Following graphs show the different annotation of  two datasets. Each color corresponds to a class on the face. The upper figure shows the annotation of Helen Dataset, which provides 11 different classes.

With all the data in hand, you can train a neural network and then segment your own face with your own model. You can even build your own selfie APPs with your preferred visual effects. Here is a small sample of our model. Hope you enjoy.

[1] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

[2] Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, Jianchao Yang. Exemplar-Based Face Parsing, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June, 2013

[3] Andrew Kae*, Kihyuk Sohn*, Honglak Lee, and Erik Learned-Miller. Augmenting CRFs with Boltzmann Machine Shape Priors for Image Labeling. Computer Vision and Pattern Recognition, 2013.


Popular Posts