ARKit and TrueDepth camera precision evaluation on iPhone X
Since the release of ARKit by Apple in June 2017, developers can create augmented reality applications for iOS. This framework allows us to create a wide range of augmented reality experiences such as worldtracking, facetracking, scene understanding, measurements... Combined with the brand new TrueDepth frontcamera, builtin the iPhone X, ARKit can be used to do precise facetracking, detect more than 50 different facial expressions, detect the topology, position and orientation of a face.
The TrueDepth camera uses a 3Dsensor which can detect the depth of an object or a face by projecting and detect thousands of infrared dots.
When a
The TrueDepth camera uses a 3Dsensor which can detect the depth of an object or a face by projecting and detect thousands of infrared dots.
When a
FaceTrackingConfiguration
is launched with ARKit, the TrueDepth camera looks for faces and creates an object ARFaceAnchor
composed of 1220 3D points when a face is detected.Fig 1 : Face mesh created by ARKit based on the user's face 
In order to know if we can use those technologies to improve userexperience in our augmentedreality applications, we have done some precision evaluations and comparisons with our current facetracking framework.
Firstly, we have done some measures on a dummy head named July and compared them to real measures. To do this, we selected distances between specific points to measure on July such as the gap between the eyes, the length of the nose, width of mouth, ... ARKit provides coordinates in meters so it was quite easy for us to do measurements and compare them to the real ones.
For 1000 different frames, we measure each distance, save them and when the acquisition is done, compute the mean, the median and the standard deviation of each set of measures.
Fig 2 : Mean, median and standard deviation values for each set of measures (in cm) 
To do the acquisition, we set the iPhone on a tripod, put the dummy in front of it and make it move, change its orientation, etc.
As ARKit calculates and tracks the face 60 times per second and as we changed dummy's position during the acquisition, there could be some variations in the measures, that is why we calculated mean, median and standard deviation values. Despite the movements of the dummy, we had extremely precise results on each distance measured, with a maximum standard deviation of 0.8 millimeters for the mouth's width. Those results show the robustness of ARKit's facetracking, which can be very precise on realtime measurements on the user's head, even when this one is moving.
We can see in this chart that WLT seems to be more precise and close to reality than ARKit. As we selected 68 ARKit points to match WLT points, we didn't measure the real precision of ARKit, but the precision of only those points, which are not necessarily the best for each photo. Actually, ARKit points are not fixed to the face, they are moving with the facial expressions detected, in order to adapt the mesh to the user's face. That is why, when we selected some points to compare to WLT points, they are not always the closest ones to the reference points. If for each photo, we select only the closest ARKit points to reference points, we get way more precise results for ARKit and less scattered values.
With this selection of points, ARKit looks closer to reality than WLT. After those comparisons, we can say that ARKit can provide way more information than WLT and could be really precise. The fact that ARKit provides points in 3D coordinates and that there are more points than WLT could also help us to improve the precision, details and the user experience in our augmentedreality apps. The main drawback of this framework is that the mesh does not have a coherent position on the face, it is not possible to associate a point of the mesh with a facial feature.
As we know that the embedded technologies in the iPhone X can do very precise facetracking, we now need to know how precise the measures are, compared to realworld measures.
To do that, we have simply done measures on July with a ruler and compared them to the mean values of each measures.
Fig 3 : Measured distances on July's face and differences with mean ARKit's measures (in cm) 
We can see here that ARKit's measures and real ones are approximatively matching. Only the face's height has quite different values (5mm difference) but this could be due to the measurements directly done on the face with a ruler, which is not the most precise method to do face measurements...
Once this precision evaluation has been done, we had to compare the ARKit's facetracking efficiency with the Wisimage Landmarks Tracker's one.
To compare those two frameworks, we need to have the same initial conditions and types of values. When WLT provides 68 points in 2D coordinates describing the detected face, ARKit provides 1220 points in 3D coordinates. We firstly need to put ARKit's values in 2D coordinates and select only 68 points to match the values provided by WLT. ARKit framework provides functions to transpose 3D coordinates in 2D coordinates. We arbitrarily selected ARKit points to approximatively match the landmarks described by the WLT points.
Fig 4 : ARKit points (in red) and WLT points (in green) 
To provide ground truth, we acquired a small database of 180 photos from nine different (real) faces, and we manually labeled them with 68 points. After that, we compared the points provided by ARKit to the reference points and the same for WLT points, for each photo. To measure the difference between reference points and points provided by the facetracking systems, we compute the average euclidean distance between points and their related reference, normalized with the diagonal of the enclosing bounding box. For a set of points $P=(p_1\ldots p_n), p_i \in \mathbb{R}^2$ and their corresponding ground truth $GT=(gt_1\ldots gt_n), gt_i \in \mathbb{R}^2$ :
$$error(P, GT) = \frac{1}{d_{diag}*n} \sum_{i=1}^n PGT$$
$$error(P, GT) = \frac{1}{d_{diag}*n} \sum_{i=1}^n \sqrt{(x_{p_i}x_{gt_i})^2+(y_{p_i}y_{gt_i})^2}$$
$$error(P, GT) = \frac{1}{d_{diag}*n} \sum_{i=1}^n PGT$$
$$error(P, GT) = \frac{1}{d_{diag}*n} \sum_{i=1}^n \sqrt{(x_{p_i}x_{gt_i})^2+(y_{p_i}y_{gt_i})^2}$$
where
$d_{diag}=\sqrt(w^2+h^2)$ with $(w,h)$ the size of the enclosing
bounding box of $P$.
After getting this mean error value, we look at the cumulative error distribution between ARKit and WLT, based on 116 detected faces.

We can see in this chart that WLT seems to be more precise and close to reality than ARKit. As we selected 68 ARKit points to match WLT points, we didn't measure the real precision of ARKit, but the precision of only those points, which are not necessarily the best for each photo. Actually, ARKit points are not fixed to the face, they are moving with the facial expressions detected, in order to adapt the mesh to the user's face. That is why, when we selected some points to compare to WLT points, they are not always the closest ones to the reference points. If for each photo, we select only the closest ARKit points to reference points, we get way more precise results for ARKit and less scattered values.

Comments
Post a Comment