ML | Unsupervised face clustering pipeline

Counters | File handling | NumPy | Python Methods and Functions

Live face recognition — this is a problem that the automated security unit still faces. Thanks to advances in neural networks Convolutions and Region-CNN`s special creative ways, it is already confirmed that with our current technologies, we can choose supervised learning options like FaceNet, YOLO for fast and lively face recognition in a real environment. .
To train a supervised model, we need to get datasets with target labels, which is still a tedious task. We need an efficient and automated solution to dataset generation with minimal effort to tagging through user intervention.

Proposed solution —

Introduction. We offer a dataset generation pipeline that takes a video clip as a source and extracts all faces and groups them into limited and accurate sets of images representing an individual. Each set can be easily tagged by a human.

Technical details: We`re going to use the opencv lib to extract frames per second from the input video clip. 1 second seems appropriate to cover relevant data and limited frames for processing.
We will use the face_recognition library (supported by dlib ) to extract faces from frames and align them to extract objects.
We will then extract human observable features and group them using DBSCAN clustering provided by scikit-learn .
As a solution, we will trim all the edges, create labels and group them into user folders to adapt them as a dataset for their educational use cases.

Implementation issues: For a wider audience, we plan to implement a solution to run on the CPU rather than on the NVIDIA GPU. Using an NVIDIA GPU can improve pipeline efficiency.
Processor implementation of face embed extraction is very slow (30+ seconds per images). To deal with the problem, we implement them with parallel pipelined execution (resulting in ~ 13 seconds per image) and then combine their results for further clustering tasks. We are introducing tqdm along with PyPiper to update progress and resize frames extracted from the input video for smooth pipeline execution.

  Input:  Footage.mp4  Output:    

Required Python3 modules :
os, cv2, numpy, tensorflow, json, re, shutil, time, pickle, pyPiper, tqdm, imutils, face_recognition, dlib, warnings, sklearn

Snippets section:
For the contents of the file, which contains all the class definitions, below are snippets and an explanation of how they work.

The ResizeUtils class implementation provides the rescale_by_height and rescale_by_width .
"Rescale_by_width" — it is a function that takes "image" and "target_width" as input. It increases / decreases the size of the image in width to fit target_width . The height is calculated automatically so the aspect ratio remains the same. rescale_by_height is the same, but instead of width it targets height.

“” ”
ResizeUtils provides resizing functionality

keep aspect ratio unchanged

Credits: AndyP on StackOverflow & # 39; & # 39; & # 39;

class ResizeUtils:

# Given the given height, adjust the image

  # by calculating width and resizing

def rescale_by_height ( self , image, target_height,

method = cv2.INTER_LANCZOS4):


# Scale the `image` to` target_height`

# (maintain aspect ratio)

  w = int ( round (target_height * image.shape [ 1 ] / image.shape [ 0 ]))

return (cv2.resize (image, (w, target_height), 

interpolation = method))


  # Given the given width, adjust the image

  # by calculating the height and resizing

def rescale_by_width ( self , image, target_width,

method = cv2.INTER_LANCZOS4):


  # Scale `image` to` target_width`

# (maintain aspect ratio)

h = int ( round (target_width * image.shape [ 0 ] / image.shape [ 1 ]))

return (cv2.resize (image, (target_width, h) ,

interpolation = method))

Following is the definition of the FramesGenerator class. This class provides functions to extract JPG images by sequentially reading the video. If we take an example of an input video file, it can have a frame rate of ~ 30 frames per second. We can conclude that there will be 30 images in 1 second of video. Even for a 2-minute video, the number of images to process will be 2 * 60 * 30 = 3600. This is too many images to process and it may take hours to fully process the pipeline.

But another fact comes that faces and people cannot change for a second. So, given a 2 minute video, generating 30 images in 1 second is cumbersome and repetitive to process. Instead, we can only take 1 shot in 1 second. The FramesGenerator implementation only discards 1 image per second from the video clip.

Given that the uploaded images are subject to face_recognition / dlib processing for face extraction, we try to keep the height threshold no more than 500 and width limited to 700. This limitation is imposed by the AutoResize function, which additionally calls rescale_by_height or rescale_by_width to reduce the size of the image if the limits are reached, but still maintains the aspect ratio.

AutoResize to the next snippet, the AutoResize function tries to limit the size of the given image. If the width is greater than 700, we reduce it to keep the width at 700 and keep the aspect ratio. The other limit set here is — height must not exceed 500.

# FramesGenerator fetches the image
# frames from the given video
# Image resizes for
# face_recognition / dlib processing

class FramesGenerator:

def __ init __ ( self , VideoFootageSource) :

self . VideoFootageSource = VideoFootageSource


# Resize this input to fit the specified

# size to extract faces

def AutoResize ( self , frame):

resizeUtils = ResizeUtils ()


height, width, _ = frame.shape


if height & gt; 500 :

frame = resizeUtils.rescale_by_height (frame, 500 )

  self . AutoResize (frame)


if width & gt; 700 :

frame = resizeUtils.rescale_by_width (frame, 700 )

  self . AutoResize (frame)


return frame

Below is a snippet of the GenerateFrames function. It asks for fps to determine from how many frames 1 image can be displayed. We clear the output directory and start looping through the frames. Before unloading any image, we AutoResize the size of the image if it reaches the limit specified in the AutoResize function.

# Extract 1 frame from every second from the video
# and save the frames to a specific folder

def GenerateFrames ( self , OutputDirectoryName):

cap = cv2.VideoCapture ( self . VideoFootageSource)

_, frame = ( )


fps = cap.get (cv2.CAP_PROP_FPS )

TotalFrames = cap.get (cv2.CAP_PROP_FRAME_COUNT)


print ( "[INFO] Total Frames" , TotalFrames, "@" , fps, "fps" )

print ( "[INFO] Calculating number of frames per second" )


CurrentDirectory = os.path.curdir

OutputDirectoryPath = os.path.join ( (

  CurrentDirectory, OutputDirectoryName)


if os.path.exists (OutputDirectoryPath):

  shutil.rmtree (OutputDirectoryPath)

time.sleep ( 0.5 )

os.mkdir (OutputDirectoryPath)


CurrentFrame = 1

  fpsCounter = 0

FrameWrittenCount = 1

while CurrentFrame & lt; TotalFrames:

_, frame = ()

if (frame is None ):



if fpsCounter & gt; fps:

fpsCounter = 0


frame = self . AutoResize (frame)


filename = "frame_" + str (FrameWrittenCount) + ". Jpg"

cv2.imwrite ( os.path.join ( (

  OutputDirectoryPath, filename), frame)


FrameWrittenCount + = 1


  fpsCounter + = 1

CurrentFrame + = 1


print ( `[INFO] Frames extracted` )

Below is a snippet for the FramesProvider class. It inherits "Node", which can be used to build an image processing pipeline. We will implement the "setup" and "launch" functions. Any arguments defined in the "setup" function can have parameters that will be expected by the constructor as parameters during object creation. Here we can pass the sourcePath parameter to the FramesProvider object. The “Setup” function runs only once. The "run" function starts and continues to emit data, calling the emit function to process the pipeline until the close function is called.

Here in "setup" we take sourcePath as an argument and loop through all the files in the given frames directory. Whatever file extension is .jpg (which will be generated by the FrameGenerator class), we add it to the "filesList".

During function calls run all jpg image paths from "filesList" are packed with attributes specifying a unique "id" and "imagePath" as an object and sent to the pipeline for processing.

# Below are the nodes for the pipeline constructs.
# It will create and execute streams asynchronously
# to read images, highlight facial features and
# store them independently in different streams

# Keep emitting filenames in
# pipeline for processing

class FramesProvider (Node):

def setup ( self , sourcePath):

self . sourcePath = sourcePath

self . filesList = []

for item in os.listdir ( self . sourcePath):

_, fileExt = os.path.split (ext (item)

if fileExt = = `.jpg` :

  self . filesList.append ( os.path.join ( (item))

self . TotalFilesCount = self . size = len ( self . filesList)

self . ProcessedFilesCount = self . pos = 0


# Throw each filename into the parallel pipeline

def run ( self , data):

if self . ProcessedFilesCount & lt; self . TotalFilesCount:

self . emit ({ `id` : self . ProcessedFilesCount, 

` imagePath` : os.path.join ( ( self . sourcePath, 

self . filesList [ self . ProcessedFilesCount])})

self . ProcessedFilesCount + = 1


self . pos = self . ProcessedFilesCount

else :

  self . close ()

Below is an implementation of the " FaceEncoder " class that inherits from "node" and can be passed into the image processing pipeline. In the "setup" function, we accept the "detection_method" value for the face recognizer "face_recognition / dlib" to call. It can have a cnn or hog based detector.
The "run" function unpacks the incoming data into "id" and "imagePath".

Then it reads the image from the "imagePath", runs the "face_location" defined in the "face_recognition / dlib" library, to crop the aligned face image, which is our region of interest. Aligned face — this is a rectangular cropped image in which the eyes and lips are aligned to a specific location in the image (Note: Implementation may differ from other libraries such as opencv).

Next, we call the "face_encodings" function defined in " face_recognition / dlib "to extract facial attachments from each block. This embeds floats to help you achieve the exact location of objects in the aligned face image.

We define the variable "d" as an array of blocks and associated attachments. We now wrap the "id" and the attachment array as an "encoding" key into an object and send it to the imaging pipeline.

# Encode face insert, link path
# and location and piping

class FaceEncoder (Node):

def setup ( self , detection_method = `cnn` ):

  self . detection_method = detection_method

# response_method can be cnn or hog


  def run ( self , data):

id = data [ `id` ]

imagePath = data [ `imagePath` ]

image = cv2.imread (imagePath)

rgb = cv 2.cvtColor (image, cv2.COLOR_BGR2RGB)


boxes = face_recognition.face_locations (

rgb, model = self . detection_method)


encodings = face_recognition.face_encodings ( rgb, boxes)

d = [{ "imagePath" : imagePath, "loc" : box, "encoding" : enc} 

for (box, enc) in zip (boxes, encodings)]


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      self . emit ({ `id` : id , `encodings` : d})

Below is an implementation of DatastoreManager which again inherits from "node" and can be plugged into an image processing pipeline. The purpose of this class is — dump the "encodings" array as a pickle file and use the "id" parameter to uniquely name the pickle file. We want the pipeline to be multithreaded.
To use multithreading to improve performance, we need to properly allocate asynchronous tasks and try to avoid any need for synchronization. This way, for maximum performance, we independently allow the threads in the pipeline to write data to a separate, separate file without interfering with other threading operations.

If you thin

Get Solution for free from DataCamp guru