If you haven't looked at the demo yet, please check it out here. A lot of time was spent on this part of the project and I think it really demonstrates the rest of the project well.
Tested on latest version of Chrome. Requires a device with a front-facing camera, like a laptop or smartphone.
If you have questions, find any bugs, or if a part of the system isn't working properly, please feel free to contact me here or send an email to firstname.lastname@example.org and I'll work on it as soon as I can.
Facial recognition has more and more become the leading method for user authentication on mobile devices. Apple now has Face ID on its latest iPhone, and many Android phones have it as well. However, where facial recognition hasn't been seen is pretty much everywhere else. Instead of having a credit card number, username and password, and all of the other personal bits of information that we use, this kind of authentication could one day take the place of all of those things.
This project uses the work of David Sandberg's implementation of FaceNet. FaceNet was a recent developement in computer vision and facial recognition done by a trio of researchers from Google. They were able to achieve over 99% accuracy in classifying a massive dataset of more than 260 million images. This was better than any of the previous models so far. A link to the original paper can be found here. What's special about their work is that not only can they clssify a face, but also cluster that face with other faces that look most simialr. This is because of the unique representation of a face in a 128-dimensional space that allows them to find a face's nearest neighbors.
FaceNet is a one-shot model that learns to map a face to a compact Euclidean space. Once the space has been made, standard machine learning algorithms like k-nearest neighbors and k-means can accomplish tasks like face recognition, verification, and clustering. One-shot learning can be implemented by using a Siamese network. This is where two identical neural networks take in two different inputs. If the inputs are similar, the loss will be small between the two networks. However, if the inputs are not similar, the loss will be large. The benefit to this kind of model is that in order to train on a specific person, the model doesn't need a huge dataset of images of them to train of off. Instead, one-shot learning can learn using just a few training images. The model is still trained on millions of images, but there are many publicly available datasets and models to use that have done that part already. To train on a single person, the model uses a triplet loss architecture.
Triplet loss works by taking a triplet of images. Two are of the desired identity and the other is not. In the 128-dimensional hyperspace where faces are represented, triplet loss will adjust the faces so that the distace between the two faces of the same identity is minimized and the distance from the anchor image to the image that's not of the desired identity is maximized.
There were a number of areas in this project that I experimented with when trying to make the best facial recognition system I could. Something that was important to me in this project was how quickly a face could be found and identified, and how well it could run on different hardware.
Detection and alignment of faces: There are already a number of options to detect faces and their features. I worked with Dlib's facial landmard detector but I found that it didn't perform as well in worse lighting and more importantly for me was that it couldn't detect a face from as far of a distance than the other option I went with, a Multi-task CNN. More infromation on MTCNN's can be found here. The MTCNN was also very quick to align the faces. This is where the face is aligned so that the eyes and bottom lip are in about the same location for each image.
This step must be done for every image before training and testing, so it was important that I had a system that did it quickly and accurately. During my experiments, I found that on an identical set of 48 images, Dlib was unable to correctly align 4 images whereas the MTCNN only failed on 1.
System optimization: Another main focus for this project was ensuring that the system is able to run well on older or less powerful hardware. While utilizing GPU acceleration for this task gave the best results by far, I wanted to optimize it so that it could also be run on something like a Raspberry Pi or other similar single-board computers. I found that the problem was that grabing a frame from the camera was taking a lot of time and was and I/O blocking operation. I worked on utilizing multithreading to move that process to a different core so that the reading of a frame and the calculations could take place at the same time. This increased the number of times I could check for a face and identify if one was present from around 4 times per second to almost 14 times per second. This was a massive increase, and it made a huge difference in the smoothness of the system.
Certainty threshold: The model outputs a certainty for each face that it's given based on the distance that face is from the others that the model has been trained on. During my experiments, I worked on finding a threshold that worked best for my dataset. A threshold too high and there were too many cases where the model wouldn't output an identity even though it was a face that it had trained on. conversely, a threshold too low would output an indentity when it was a face that the model hadn't been trained on. I found that the sweet spot was around 0.75. I think this number could even go higher because of the relatively small number of classes for this project but so far it has done well enough.
There are a number of things that still need to be done for this implementation of facial recognition to be ready for the real world. Technologies like Face ID or Windows Hello use different hardware than this one. They take advantage of other sensors to create a 3D mapping of the user's face. This is important because the current implementation of this project can easily be fooled by a simple image of the subject you wish to imitate. This is an obvious flaw that could be fixed in a couple of different ways. One idea is to require some sort of face movement before the user is authenticated. A blink or a smile would be enough to show the system that what it's recognizing is not just an image.