Background of the invention
Person re-identification (or person re-id for short) is defined as the problem of matching people across disjoint camera views in a multi-camera system. It is useful for a number of public security applications such as intelligent camera surveillance system. In a typical real-world application, one single person, or a watch-list of a handful of known people is provided as the target set for searching through a large volume of video surveillance footages where the people on the watch-list are likely to re-appear. A list of cropped images of people is retrieved as candidates (or can be called "filtered result") and hopefully they contain the desired people possessing the same identity to the search target.
Regardless of the increasing attention received from both the academic and the industry world, person re-identification remains an extremely challenging task, especially in practical environments. This is due to a list of reasons including: (1) the target and the person in the search space have different views (frontal view, side view, back view, etc.) due to different angle and distance between the camera and the person appeared. (2) the target people is usually captured via a very low frame rate which is typical in most of the existing recorded public space CCTV video footages in a (3) very crowed place such as the exit of a subway station with (4) many occlusions and the target is visible only from time to time. In addition, (5) the human detection algorithm applied to the surveillance video may not perform perfectly, especially in real time and a great amount of non-human objects are mistakenly detected and cropped as the disruptive inputs of the person re-id system. (6) The real world person re-id is an open set problem, meaning that there is unlimited number of classes (number of persons with different identities). Typical classification methods with limited number of trained classes do not work.
To address these difficulties, we use a deep Siamese neural network to learn a metric which maps the extracted features of the input images to another space. The mapped space has discriminative power which can effectively compute the identity similarity between the input image pairs. Besides, non-person distractor images are also fed into the network at the training stage to increase the awareness of noise of the network falsely cropped by the human detector. Thus, we do not need class-based example training and the open set problem is addressed. We design a comprehensive data preparation strategy involving a sequence of data augmentation operations. The network trained with such data is proved to be more robust to environmental changes of the camera recording and changing of views/distances and occlusions of the persons. To further improve the robustness, we use seven public datasets for person re-id and train the network to be more adaptive to the domain variations with a novel cross domain dropout strategy.
Novelty of the invention
In this invention, we provide a technique of human re-id and try to tackle the aforementioned problems. We are the first to train an improved version deep Siamese neural network (DSNN) on as many as seven public datasets to seek the most discriminative projections of the image features extracted from the previous neural network layers. Around one million image pairs labeled as "the same person" or "the different person" coming from distinct public datasets is fed into the network during the training phase. More importantly, around 100 thousand more pairs are used during the fine-tuning phase in order to eliminate the domain specific problem which can be described as the typical over-fitting problem happens on one specific training set. A joint domain dropout strategy is invented to achieve this goal by muting the neurons only significant to the re-id results from a small portion of the datasets (e.g. less than five out of seven datasets). The remained neurons are active among a wide range of camera configurations and thus are more robust when applied on a new surveillance system (e.g. a real world intelligent surveillance system).
Advantages and improvements over existing methods, devices or materials
The existing works in literature focus on small-scale, domain-specific datasets. The robustness of these algorithms is not guaranteed once faced with much more complicated real world, open set person re-id problems. This point is verified during our experiments as none of the existing re-id algorithm work consistently well when given streaming footages from increasing number of cameras with large variations on the surveillance environment. This technique is the first work that trains a human re-id algorithm on multi-dataset jointly, polished by fine-tuning preventing the domain-specific over-fitting. This fine-tuning step is proved to increase the performance and the robustness substantially. Several difficulties in human re-id are well addressed in our experiments including:
- Low input resolution and illumination
- Erroneous human detection results.
- Different viewpoints
The technique can be easily extended to real-time re-id application while offering use-friendly interface. Users can operate a system embedded with this technique for surveillance at great ease. They are also granted with the flexibility to input their own labeled data (image pairs with binary identity ) as complementary samples for further training purposes.
Commercial applications of the invention
This technique is applicable to one module of an intelligent surveillance system for efficient searching persons among different surveillance cameras.