In a general sense, Person Re-identification (aka Person ReID) belongs to a larger class of “tracking-by-detection” problems.
To understand it, let’s take a small example. We are recording a football match in a stadium, and let’s assume that the broadcaster is using 8 cameras. We are tasked with developing a computer vision model not just detects the football players (which is relatively easy), but also to assign an ID to each player. For example, if a player detected in the first minute of the game, the player will be given an ID. When the player is detected again in another frame or from another camera, say during the 10th minute, then model should know that the person was already identified in the first minute, and then assign the same ID to the detection
An abstract overview of how this system would work is as follows:
- Each time a object(could be a person) is detected, they are compared with previously detected objects (if any) using a similarity metric. If the object was already detected, then the detected person is given the same id.
- If the person or object wasn’t already detected, this new person is given an id and then stored in the database for comparison with future detections.
If the “object” mentioned above is a person, then it is person re-identification.
Person Re-Identification Use Cases
The most mentioned use case for Person Re-Identification in research literature is tracking of people in retail setting. Person Re-Identification can be used to track the movement individuals in the retail store to obtain analytics like which section of the supermarket do people spend more time, which person is more likely to shoplift, etc.
According to a study by Checkpoint Systems and Euro monitor, retail theft alone cost retailers an estimated amount of $112 Billion dollars. This is where Person Re-Identification can play a major role. By using Person Re-Identification, retail stores will be able to track the activities of its customers across various cameras and then identify shoplifting activities.
Person Re-Identification can also be used in Industrial monitoring. A report from the International Labor Organization tells that 340 million occupational accidents happen on a global scale annually. Real-time tracking of staff and their interaction with objects can help reduce unauthorized usage of machinery and other objects and will also help identify and potentially reduce accidents. Using person tracking footage, we can also derive insights to understand why such accidents occur and frame policies to avoid such accidents in the future.
One of the earliest use cases of Person Re-Identification is video surveillance in Law Enforcement. Person Re-Identification can play a major role in identifying and tracking people on police databases, and if applied correctly can also be used to track vehicles. There are ethical concerns when it comes to mass surveillance and they are valid. However, with the right regulations and enforcement, these systems can play a huge role in person search and tracking.
Challenges in Person Re-identification
There are a set of challenges associated with this problem and researchers and practitioners have been working for years on this problem. Even though this might sound like a cliché, deep learning models have produced much better results in the past few years (I know, you read that in virtually every paper on deep learning, but it is true).
Yet, despite the strides due to deep learning models, there are challenges that still persist.
The most challenging part of person re-identification happens when we don’t have much idea about the people that we’ll have to re-identify. If we are to detect and track a set of people, whose images or feature vectors we already know, then the problem becomes relatively easy. However, in most cases, that isn’t the case and hence the images of the people that have to be added to the “re-identification database” in real-time while detecting those people itself.
But, why is that a problem? Well, there is a problem of variation in viewpoints.
If the camera first encounters a person from a specific angle, it would store the embeddings/features of the person to the database. Then, when the camera encounters the same person from a different angle, then those initial embeddings or features may not be enough to re-identify the person and hence the possibility of the new detection being detected as a different person is very high.
Another problem is based on the use case itself. In a lot of person ReID deployment scenarios, the input images consist of people present in a long distance or the quality of images (usually from CCTV cameras) are of low resolution. When people are present in a long distance from the camera, then there is very little information about the person that deep learning models can infer to perform re-identification. In the case of low resolution, we’ll face the same issue of deep learning models not being able to infer enough information about the people during their first identification.
Person Re-Identification Methods
All Person ReID methods are comprised of two parts. A model that detects people and another model that provides feature embeddings using which we calculate similarity to re-identify people.
Let’s look at some methods proposed over the past couple of years that work on open settings.
Deep sort uses a combination of Kalman filter and CNN to perform re-identification and tracking. For detection of humans, any object detection model can be used, though Yolov5 (pre-trained on coco dataset) usually works really well.
The CNN provides the embeddings for each detection which can be used for making associations with existing detections. Kalman filters are used to detect “tracks” across frames, which can be used to create or throw associations of people who leave or come into the frame.
The authors of the paper used Mahalanobis distance and cosine distance to calculate the distance between feature embeddings. In the original paper, the CNN was trained on the MARS dataset.
The prediction of bounding boxes is based on a eight dimensional state (u, v, y, h, u^hat, v^hat, y^hat, h^hat). The states u, v, y, and h the center point, aspect ratio and height respectively. The states u^hat, v^hat, y^hat, h^hat form the velocity vector that is used by the Kalman filter to make estimations.
The reason why we use Kalman filters is because sometimes, the model may miss finding an association in a single frame, but may re-identify the person in the previous and consecutive frames. The Kalman filter will help us identify such missed re-identifications.
You can read the paper here.
The basis of Siamese Networks is One-shot learning. One-shot learning is a way to perform classification tasks on classes the model hasn’t seen before, but are related in some form to the training data.
So let’s say we need to build a facial bio-metrics system for a company with 10,000 employees. Initially we can get the images of 10,000 employees and then build a classifier model. Now, what if 200 more employees join next week? Well, we can re-train a model for the existing 10000+200 employees again. But the problem is, every time a new employees joins, we need to retrain the model to detect the person.
Siamese networks can help in such situations. Siamese networks don’t learn to classify directly, but rather learns differences between the different labels from the training data. In the facial bio-metrics case, the labels are people working in the company.
Since the Siamese network learns only the differences between classes, differences between faces here, this model can be generalized to calculate differences between faces of people that the model hasn’t seen before.
Two popular loss functions used to train Siamese Networks are the Triplet Loss and the Contrastive loss.
Just like Deepsort, any object detection model can be used to retrieve the people in the frame; the Siamese network can then be used to attach similarity scores to identify people.
Learn more about Siamese Networks here.
Variational Autoencoders (VAEs) belong to a class of generative models that can model a dataset’s distribution. Once a VAE is trained to model a distribution, we can then use the VAE to generate new samples that match the distribution, even if the VAE hasn’t seen that new sample.
The structure of a VAE is quite simple; it consists of two parts, the encoder and decoder. The encoder takes an image as input (I), and provides feature embeddings. In Autoencoders, the embeddings are known as latent variables. The work of the decoder is to take the latent variable from the encoder and reconstruct the original image I.
But, how does this help us with person ReID?
Well, the latent variables from the encoders can serve as good embeddings for comparing similarities between new detections and existing detections. The VAE, if trained on a good amount of data, will learn to produce really good latent variable vectors which can provide a lot of information about the person of interest.
So, we can train a VAE on data with people, then throw away the decoder, because we only need the encoder to produce the latent variables.
Just like in the other models, we can use any object detection model to detect people, while using the VAE to calculate the similarity for person ReID.
Learn more about VAE here.