One of the coolest features of iPhone X is its unlocking method: FaceID. This article explains how this technology works.
The image of the user's face is captured using an infrared camera, which is more resistant to changes in light and colour in the environment. Using deep learning, the smartphone is able to recognise the user's face in minute detail, thereby 'recognising' the owner every time they pick up their phone. Surprisingly, Apple has claimed that this method is even safer than TouchID, with an error rate of 1:1,000,000.
This article breaks down the principle of a FaceID-like algorithm using Keras. It also presents some of the final developments created with Kinect.
"... the neural networks on which FaceID technology is based do more than just perform classification."
The first step is to analyse how FaceID works on the iPhone X. The technical documentation can help us with this. With TouchID, the user had to first register their fingerprints by tapping the sensor several times. After 10-15 different touches, the smartphone completes the registration.
Similarly with FaceID: the user has to register their face. The process is quite simple: the user simply looks at the phone as he does on a daily basis and then slowly turns his head in a circle, thus registering his face in different poses. This completes the registration and the phone is ready to be unlocked. This incredibly fast registration procedure can tell us a lot about basic learning algorithms. For example, the neural networks on which FaceID technology is based don't just perform classification.
Performing classification for a neural network means being able to predict whether the face it "sees" at a given moment is the face of a user. Thus, it must use some training data to predict "true" or "false," but unlike many other deep learning applications, this approach will not work here.
Firstly, the network has to train from scratch using new data from the user's face. This would require a lot of time, energy, and lots of data from different individuals (other than the user's face) to have negative examples. Also, this method would not allow Apple to train a more complex network "offline", i.e. in its labs, and then ship it already trained and ready to be used in its phones. FaceID is said to be based on a Siamese convolutional neural network that is trained "offline" to display faces in a low-dimensional hidden space shaped to maximise the difference between the faces of different people using contrast loss. You get an architecture capable of doing one-shot learning, as mentioned in Keynote.
From faces to numbers
The Siamese neural network is basically two identical neural networks that also share all the weights. This architecture can learn to discriminate distances between specific data such as images. The idea is that you pass pairs of data through Siamese networks (or just pass data in two different steps through the same network), the network maps them into low-dimensional features of space like an n-dimensional array, and then you train the network to make a mapping such that data points from different classes were as far away as possible, while data points from the same class were as close as possible.
Eventually the network will learn to extract the most relevant features from the data and compress them into an array, creating an image. To understand this, imagine how you would describe dog breeds using a small vector so that similar dogs would have almost similar vectors. You would probably use one number to code the colour of the dog, another to code the size of the dog, a third to code the length of the coat, and so on. In this way, dogs that are similar to each other will have similar vectors. A Siamese neural network can do this for you, similar to what an autoencoder does.
Using this technology, it takes a large number of faces to train such an architecture to recognise the most similar ones. With the right budget and processing power (as Apple does), you can also use more complex examples to make the network robust to cases such as twins, masks, etc.
What's the final advantage of this approach? Is that you finally have a plug and play model that can recognise different users without any additional training, but simply calculate finding the user's face on the hidden face map formed after configuring FaceID. In addition, FaceID is able to adapt to changes in your appearance: both sudden (e.g. glasses, hat, make-up) and "gradual" (growing hair). This is done by adding facial reference vectors calculated from your new appearance to the map.
FaceID implementation with Keras
As for all machine learning projects, the first thing we need is data. Creating your own dataset will take time and the co-operation of many people, so this can be difficult to manage. There is a website with a dataset of RGB-D faces. It consists of a series of RGB-D photos of people standing in different poses and making different facial expressions, as would happen with an iPhone X. To see the final implementation, here's a link to GitHub.
A convolutional network based on the SqueezeNet architecture is created. The network takes both RGBD images of face pairs and 4-channel images as input, and outputs the differences between the two attachments. The network is trained with a significant loss that minimises the difference between images of the same person and maximises the difference between images of different faces.
Once trained, the network is able to convert faces into 128-dimensional arrays so that pictures of the same person are grouped together. This means that to unlock the device, the neural network simply calculates the difference between the photo required during unlocking and the images stored during the registration phase. If the difference fits a certain value, the device is unlocked.
The t-SNE algorithm is used. Each colour corresponds to some person: as you can see, the network has learned to group these photos quite tightly. An interesting graph also emerges when using the PCA algorithm to reduce the dimensionality of the data.
Now let's try to see how this model works by simulating a normal FaceID loop. The first thing we will do is register a face. Then let's perform unlocks both from the user's face and from other people who should not unlock the device. As mentioned earlier, the distinction between a face that the phone "sees" and a registered face has a certain threshold.
Let's start with registration. Let's take a series of photos of the same person from the dataset and model the registration phase. The device now calculates the attachments for each of these poses and stores them locally.
Let's see what happens if the same user tries to unlock the device. Different poses and facial expressions of the same user achieve a low variance, around 0.30 on average.
On the other hand, images from different people get an average difference of about 1.1.
Thus, a threshold value set to about 0.4 should be sufficient to prevent strangers from unlocking the phone.
In this post, I showed how to implement a verification concept for FaceID unlock mechanics based on edge embedding and Siamese convolutional networks. I hope the information was useful for you. You can find all the relative Python code here.