Pattern Recog. 2015
Deep Feature Learning with Relative Distance Comparison for Person Re-identification
Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao
Pattern Recog. 2015

Abstract


Fig.1 Typical examples of pedestrians shot by different cameras. Each column corresponds to one person. Huge variations exist due to the light, pose and view point changes.

Identifying the same individual across different scenes is an important yet difficult task in intelligent video surveillance. Its main difficulty lies in how to preserve similarity of the same person against large appearance and structure variation while discriminating different individuals. In this paper, we present a scalable distance driven feature learning framework based on the deep neural network for person re-identification, and demonstrate its effectiveness to handle the existing challenges. Specifically, given the training images with the class labels (person IDs), we first produce a large number of triplet units, each of which contains three images, i.e. one person with a matched reference and a mismatched reference. Treating the units as the input, we build the convolutional neural network to generate the layered representations, and follow with the $L2$ distance metric. By means of parameter optimization, our framework tends to maximize the relative distance between the matched pair and the mismatched pair for each triplet unit. Moreover, a nontrivial issue arising with the framework is that the triplet organization cubically enlarges the number of training triplets, as one image can be involved into several triplet units. To overcome this problem, we develop an effective triplet generation scheme and an optimized gradient descent algorithm, making the computational load mainly depends on the number of original images instead of the number of triplets. On several challenging databases, our approach achieves very promising results and outperforms other state-of-the-art approaches.Experiments

Fig.2 Illustration of deep feature learning via relative distance maximization. The network is trained by a set of triplets to produce effective feature representations with which the true matched images are closer than the mismatched images.

 

 

Deep Architecture


Fig. 3 Illustration of maximizing the distance for person re-identification. The $L_2$ distance in the feature space between the matched pair should be smaller than the mismatched pair in each triplet.

 

 

Comparisons with state of the art methods


CMC_curve

Fig. 5 Performance comparison using CMC curves on i-LIDS( left) and VIPeR (right) dataset.

table_deepreid

Table 1 . Performance of different models on i-LIDS (left) and VIPeR (right) dataset.

visualize-results-deepreid

Fig. 6 Search examples on i-LIDS (left) and VIPeR (middle) dataset, and Visualization of feature maps generated by our approach (right). Each column represents a ranking result with the top image being the query and the rest images being the returned list. The image with the red bounding box is the matched one. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

 

Conclusion


In this project, we present a scalable deep feature learning model for person re-identification via relative distance comparison. In this model, we construct a CNN network that is trained by a set of triplets to produce features that can satisfy the relative distance constraints organized by that triplet set. To cope with the cubically growing number of triplets, we present an effective triplet generation scheme and an extended network propagation algorithm to efficiently train the network iteratively. Our learning algorithm ensures the overall computation load mainly depends on the number of training images rather than the number of triplets. The results of extensive experiments demonstrate the superior performance of our model compared with the state-of-the-art methods. In future research, we plan to extend our model to more datasets and tasks

 

 

References


[1]L. Lin, T. Wu, J. Porway, Z. Xu, A stochastic graph grammar for compositional object representation and recognition Pattern Recognit., 42 (7) (2009), pp. 1297–1307

[2]D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: ECCV, Springer, 2008, pp. 262–275.

[3]X. Wang, G. Doretto, T. Sebastian, J. Rittscher, P. Tu, Shape and appearance context modeling, in: ICCV, IEEE, 2007, pp. 1–8.

[4]R. Layne, T.M. Hospedales, S. Gong, Towards person identification and re-identification with attributes, in: ECCV, Springer, 2012, pp. 402–412.

[5]M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by symmetry-driven accumulation of local features, in: CVPR, IEEE, 2010, pp. 2360–2367.