Live face recognition — this is a problem that the automated security unit still faces. Thanks to advances in neural networks Convolutions and Region-CNN’s special creative ways, it is already confirmed that with our current technologies, we can choose supervised learning options like FaceNet, YOLO for fast and lively face recognition in a real environment. .
To train a supervised model, we need to get datasets with target labels, which is still a tedious task. We need an efficient and automated solution to dataset generation with minimal effort to tagging through user intervention.
Proposed solution —
Introduction. We offer a dataset generation pipeline that takes a video clip as a source and extracts all faces and groups them into limited and accurate sets of images representing an individual. Each set can be easily tagged by a human.
Technical details: We’re going to use the
opencv lib to extract frames per second from the input video clip. 1 second seems appropriate to cover relevant data and limited frames for processing.
We will use the
face_recognition library (supported by
dlib ) to extract faces from frames and align them to extract objects.
We will then extract human observable features and group them using DBSCAN clustering provided by scikit-learn .
As a solution, we will trim all the edges, create labels and group them into user folders to adapt them as a dataset for their educational use cases.
Implementation issues: For a wider audience, we plan to implement a solution to run on the CPU rather than on the NVIDIA GPU. Using an NVIDIA GPU can improve pipeline efficiency.
Processor implementation of face embed extraction is very slow (30+ seconds per images). To deal with the problem, we implement them with parallel pipelined execution (resulting in ~ 13 seconds per image) and then combine their results for further clustering tasks. We are introducing
tqdm along with PyPiper to update progress and resize frames extracted from the input video for smooth pipeline execution.
Input: Footage.mp4 Output:
Required Python3 modules :
os, cv2, numpy, tensorflow, json, re, shutil, time, pickle, pyPiper, tqdm, imutils, face_recognition, dlib, warnings, sklearn
For the contents of the
FaceClusteringLibrary.py file, which contains all the class definitions, below are snippets and an explanation of how they work.
ResizeUtils class implementation provides the
"Rescale_by_width" — it is a function that takes "image" and "target_width" as input. It increases / decreases the size of the image in width to fit
target_width . The height is calculated automatically so the aspect ratio remains the same.
rescale_by_height is the same, but instead of width it targets height.
Following is the definition of the
FramesGenerator class. This class provides functions to extract JPG images by sequentially reading the video. If we take an example of an input video file, it can have a frame rate of ~ 30 frames per second. We can conclude that there will be 30 images in 1 second of video. Even for a 2-minute video, the number of images to process will be 2 * 60 * 30 = 3600. This is too many images to process and it may take hours to fully process the pipeline.
But another fact comes that faces and people cannot change for a second. So, given a 2 minute video, generating 30 images in 1 second is cumbersome and repetitive to process. Instead, we can only take 1 shot in 1 second. The FramesGenerator implementation only discards 1 image per second from the video clip.
Given that the uploaded images are subject to
face_recognition / dlib processing for face extraction, we try to keep the height threshold no more than 500 and width limited to 700. This limitation is imposed by the AutoResize function, which additionally calls
rescale_by_width to reduce the size of the image if the limits are reached, but still maintains the aspect ratio.
AutoResize to the next snippet, the
AutoResize function tries to limit the size of the given image. If the width is greater than 700, we reduce it to keep the width at 700 and keep the aspect ratio. The other limit set here is — height must not exceed 500.
Below is a snippet of the
GenerateFrames function. It asks for fps to determine from how many frames 1 image can be displayed. We clear the output directory and start looping through the frames. Before unloading any image, we
AutoResize the size of the image if it reaches the limit specified in the
Below is a snippet for the
FramesProvider class. It inherits "Node", which can be used to build an image processing pipeline. We will implement the "setup" and "launch" functions. Any arguments defined in the "setup" function can have parameters that will be expected by the constructor as parameters during object creation. Here we can pass the
sourcePath parameter to the
FramesProvider object. The “Setup” function runs only once. The "run" function starts and continues to emit data, calling the
emit function to process the pipeline until the
close function is called.
Here in "setup" we take
sourcePath as an argument and loop through all the files in the given frames directory. Whatever file extension is
.jpg (which will be generated by the
FrameGenerator class), we add it to the "filesList".
During function calls
run all jpg image paths from "filesList" are packed with attributes specifying a unique "id" and "imagePath" as an object and sent to the pipeline for processing.
Below is an implementation of the " FaceEncoder " class that inherits from "node" and can be passed into the image processing pipeline. In the "setup" function, we accept the "detection_method" value for the face recognizer "face_recognition / dlib" to call. It can have a cnn or hog based detector.
The "run" function unpacks the incoming data into "id" and "imagePath".
Then it reads the image from the "imagePath", runs the "face_location" defined in the "face_recognition / dlib" library, to crop the aligned face image, which is our region of interest. Aligned face — this is a rectangular cropped image in which the eyes and lips are aligned to a specific location in the image (Note: Implementation may differ from other libraries such as opencv).
Next, we call the "face_encodings" function defined in " face_recognition / dlib "to extract facial attachments from each block. This embeds floats to help you achieve the exact location of objects in the aligned face image.
We define the variable "d" as an array of blocks and associated attachments. We now wrap the "id" and the attachment array as an "encoding" key into an object and send it to the imaging pipeline.
Below is an implementation of
DatastoreManager which again inherits from "node" and can be plugged into an image processing pipeline. The purpose of this class is — dump the "encodings" array as a pickle file and use the "id" parameter to uniquely name the pickle file. We want the pipeline to be multithreaded.
To use multithreading to improve performance, we need to properly allocate asynchronous tasks and try to avoid any need for synchronization. This way, for maximum performance, we independently allow the threads in the pipeline to write data to a separate, separate file without interfering with other threading operations.
If you thin