|
1.INTRODUCTIONPerson re-identification (ReID) is to query a person across cameras, which is widely applied in video surveillance, security, and smart city. Recently, various person ReID methods1-10 have been proposed and achieved a good performance on holistic cases. However, occlusion of a person happens when meeting obstacles (e.g., plants, cars, other pedestrians), so occlusion is one main challenge of ReID. Particularly, this problem caused by occlusion is called Occluded Person Re-identification11,12. Local features are proved to be helpful to solve occlusion problem4,10,13, then new problems are how to segment local features reasonably and how to align local features precisely. Simply horizontally partition human features into several strips (e.g., PCB4) causes serious mis-alignment, some works introduce extra-cues (e.g., body key-points10,12] or human parsing8,14) for local feature extraction and achieve semantic alignment. However, extra-cue methods usually depend on an additional model, which greatly increases computation and parameters. What’s more, the model is trained on datasets other than ReID datasets, thus leads to performance decrease on ReID datasets and influence local feature extraction and alignment. Is there a simple but efficient end-to-end model? Concretely, a model does not introduce extra-cue, performs better than extra-cue methods, and is more light-weighted than extra-cue methods. Previous VPM9 is an end-to-end model more light-weighted than extra-cue methods but performs worse than extra-cue methods. As previous works4,15,16, we combine global and local branches, and further employ fully connection for global-local balance. In addition, previous works9,12,16 usually apply a following softmax on masks, thus improperly restrict the representation of masks, we remove the softmax and propose a more general weighted pooling for masks without softmax. Finally, in this work, we contribute as follow:
2.THE PROPOSED METHOD2.1Local feature extractionAs in Figure 1, we adopt a mask-based manner for local feature extraction. Because most mask-based methods 9, 12, 16] have similar structures for local feature extraction, we would emphasise the difference of our method. In our work, the Figure 1.The Proposed Part-Aware Network. The feature extraction module is a backbone network (e.g., ResNet50) and the mask generator is a 1×1 convolution layer. “WP” is weighted pooling, “GMP” is global maximum pooling. “FC” is for dimension reduction and global-local balance, while “Shared FC” is a share-weight fully connection employed to all local features. ⊗ means element-wise matrix multiply, © denotes “concat”. ![]() proposed Part-Aware Network (PAN) transforms pedestrian images into an embedded feature fcat for discrimination, and this embedded feature fcat is a concat of one global embedded feature and p local embedded features. Given a pedestrian image, we first resize and input it to PAN. Through a feature extraction module (a backbone network, e.g., ResNet 17] or OSNet 18]), PAN outputs 3 global features: c, h, w are channel number, height and width of the feature respectively. Compared to VPM where these three tensors are the same tensor, we separate them to discuss whether an additional buffer layer helps to improve performance. Then a mask generator is employed to where masks M ∈ ℝ p×h×w, G(·) is the mask generator, which is a simple l×l convolution layer in our work. Though most mask-based works9,12,19 employ a following softmax to masks for a probability explanation, our opinion is that this operation affects the independence of masks and causes gradient vanishing. So we remove the softmax and develop a new weighted pooling method. Based on the masks M, we extract local features from where ⊗ denotes element-wise multiply operation, the ith mask mi ∈ ℝl×h×w, and the ith output local feature is Fl,i ∈ ℝc2×h×w. To squeeze the space information of local features, a proposed mask-based weighted pooling is employed to the local features, as in the output local embedded feature fl,i ∈ ℝc2. As elements xj of masks with a softmax satisfy xj ∈ [0,1], the elements of our masks are xj ∈ (–∞, ∞), thus the sum of a mask may equal to zero, so we sum up the absolution of the mask. Our weighted pooling is a general form of the weighted pooling in VPM 9] and PAT 16], which gets rid of probability explanation and improves gradient transfer. 2.2Global-local balance strategyAs we combine global feature and local features to a discriminative embedded feature, there is a balance between the global feature and the local features. Previous methods balance global and local branches on loss weight rather than on embedded feature dimensions, thus the range of global features and local features may be different, which means redundancy on the final embedded feature. We adopt fully connection layers to reduce dimension of embedded feature and achieve global-local balance. Concretely, we apply a fully connection layer for the global feature (dimension c3 → Ng) and a share-weight fully connection layer (“Shared FC” in Figure 1) for local features (dimension c2 → nl). The shared weight manner on all local branches restricts that the difference of local features is only from mask-based local feature extraction. We define the balance of global and local branches with the ratio of local features versus global γ where nl is the dimension of each local embedded feature (256 as default). Here we exploit a weight shared fully connection layer to adjust local feature size, thus change the ratio γ. Since γ means the importance ratio of local features vs. global feature, we expect our model reach the best performance when 1 < γ < nl. Because local features are more important than the global feature, while the global feature contains more information than a single local feature. 3.EXPERIMENT3.1Experiment settingAll models are trained and tested on Ubuntu 18.04 with 2 GTX 1080Ti GPUs. And we employ random flip and random erasing 16] as our data augmentation strategy. Samples are re-scaled to 256×128. We employ Adam optimizer and the base learning rate is 3.5×10–5. On the training, there are 20 linear warm-up epochs for learning rate going to 3.5 × 10-4, then on milestones 60 and 90 epoch, learning rate multiply 0.1 respectively. 3.2DatasetsMarket15011 and DukeMTMC-reID2 are 2 common holistic ReID datasets. Occluded Duke12 is made from DukeMTMC-reID. Its training, query, and gallery set contain 9%/100%/10% occluded images (14%/15%/10% for DukeMTMC-reID). In other words, the training set of Occluded Duke contains fewer occluded images than DukeMTMC-reID, and only occluded images are in its query set. 4.PERFORMANCE COMPARISONIn Table 1, the 1st group is CNN-based occluded methods and the 2nd group is our baseline and the proposed method. The proposed method achieves an improvement on occluded dataset (Occluded Duke +10.5% R1/+8.9% mAP) and holistic datasets (Market1501 +1.5% R1/+2.7% mAP, DukeMTMC-reID +2.2% R1/+3.0% mAP) compared to our baseline part-based method. When we replace ResNet with OSNet, a multiscale network, performance have an additional increase on all datasets (Occluded Duke +4.7% R1/+4.4% mAP, Market1501+0.7% R1/+1.4% mAP, DukeMTMC-reID+0.7% R1/+2.5% mAP). The computation and parameter number of our method (ResNet50 backbone) are 5.98 GFLOPS and 46.9 M, while the computation and parameter number of the baseline method PCB are 5.97 GFLOPS and 42.7 M. It is safe to say we achieve our design goal. Table 1.Performance comparison on public datasets.
5.ABLATION STUDY5.1Share-weight fully connectionCompared shared and not shared in Table 2, shared fully connection has little influence on holistic datasets, but shows consistent improvement on occluded dataset in spite of backbones (ResNet50 +1.0% R1/+0.9% mAP, OSNet+4.5% R1/+3.4% mAP). So the restriction on generating local features is helpful for occluded cases, perhaps leads to more precise local feature alignment. Table 2.Comparison of shared and not shared fully connection for local features.
5.2Global-local balanceWe vary local-global feature ratio γ in equation (4) by changing shared FC output channels in Figure 2. There are two peaks on the curve, the first peak is at γ = 1.75, the second peak is at γ = 7. When γ increase from 0.875 to 1.75, performance increase 1.4% Rank-1 and 1.4% mAP scores, thus indicates local branch is necessary for discrimination. When γ increases from 1.75 to 7, a valley is at γ = 3.5, so there may be a competition between global and local branches. And when γ increases from 7 to 28, performance continuously drops down, which means global feature cannot be replaced by local features. Finally, we choose γ = 7 (nl = 256) for our model. 5.3The usage of mask generatorIn Table 3, we discuss the usage of mask generator, including using which tensor as Table 3.The usage of mask generator.
Note: “Conv40” and “Conv50” are the last two ResNet blocks; “Conv41” is a copied block of “Conv40”; “Conv51” is a copied block of “Conv50”. 5.4Mask numberWe further study the influence of mask number (or parts), as shown in Figure 3. In our model, best mask number is 14 on Occluded Duke, and 16 on Market1501, which means fine-grained masks is helpful not only for occluded cases but also for holistic cases. The observation is different from PAT21 on holistic cases, because PAT employs a softmax to masks and adopts a loss to supervise the diversity of local features. However, when we employ softmax to masks, Rank-1/mAP for Occluded Duke decreases from 62.5%/52.6% to 60.6%/51.5%, thus the probability explanation (softmax operation) is not necessary for masks. 5.5Weighted poolingIn Table 4, we compare commonly used weighted pooling (use softmax to provide probability explanation to masks) and the proposed weighted pooling without softmax. Our proposed method shows consistent improvement on most datasets, especially on Occluded Duke (Rank-1+1.9%/mAP+1.1%). The proposed weighted pooling without softmax improves the gradient transfer, so performance commonly increase. For occlusion cases, our opinion is that without softmax, different local features freely combine to discriminative features (Figure 4), thus improve occlusion performance. Figure 4.Visualization of masks. The 1st row is original images, the 2nd row is fusion masks of all masks, the 3rd, 4th, and 5th row are masks 1, 2, 3 of each person. ![]() Table 4.The comparison of weighted pooling.
5.6Visualization of masks12 different persons’ masks are shown on the visualization of Figure 4. The 1st row is original images; the 2nd row is the fusion of all masks, they are all concerned on bodies; the rows from 4th to 6th are the 1st to 3rd masks, they are combination of different parts, as mentioned on Section 5.5. Visualization of masks shows that our method achieves good performance for concerning on human bodies, but mis-alignment also exists (e.g., the 1st mask of the last person), this might be caused by the lack of restriction of masks. 6.CONCLUSIONIn this work, we propose a simple but efficient architecture for occluded person re-identification with a weakly supervised mask generator and a share-weight fully connection layer, and extensive experiments show the effectiveness of our architecture. We also discover that a conflict of global and local branches, and a separated buffer layer is helpful to fix the conflict. ACKNOWLEDGMENTSThis work was supported in part by the Science and Technology Program of Guangdong Province under Grant 2021B1101270007. REFERENCESZheng, L., Shen, L., Tian, L., Wang, S., Wang, J. and Tian, Q.,
“Scalable person re-identification: A benchmark,”
in 2015 IEEE Inter. Conf. on Computer Vision (ICCV),
1116
–1124
(2015). Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R. and Tomasi, C.,
“Performance measures and a data set for multitarget, multi-camera tracking,”
Computer Vision—ECCV 2016 Work, 17
–35
(2016). Google Scholar
Li, W., Zhao, R., Xiao, T. and Wang, X.,
“DeepReID: Deep filter pairing neural network for person reidentification,”
in 2014 IEEE Conf. on Computer Vision and Pattern Recognition,
152
–159
(2014). Google Scholar
Sun, Y., Zheng, L., Yang, Y., Tian, Q. and Wang, S.,
“Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),”
Computer Vision—ECCV 2018, 501
–518
(2018). https://doi.org/10.1007/978-3-030-01225-0 Google Scholar
Zhang, X., Luo, H., Fan, X., Xiang, W., Sun, Y., Xiao, Q., Jiang, W., Zhang, C. and Sun, J.,
“AlignedReID: Surpassing human-level performance in person re-identification,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
(2017). Google Scholar
Luo, H., Gu, Y., Liao, X., Lai, S. and Jiang, W.,
“Bag of tricks and a strong baseline for deep person reidentification,”
in 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition Work. (CVPRW),
14871495
(2019). Google Scholar
He, L., Liang, J., Li, H. and Sun, Z.,
“Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach,”
in 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition,
7073
–7082
(2018). Google Scholar
He, L., Wang, Y., Liu, W., Zhao, H., Sun, Z. and Feng, J.,
“Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification,”
in 2019 IEEE/CVF Inter. Conf. on Computer Vision (ICCV),
(2019). Google Scholar
Sun, Y., Xu, Q., Li, Y., Zhang, C., Li, Y., Wang, S. and Sun, J.,
“Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification,”
in 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
393
–402
(2019). Google Scholar
Wang, G., Yang, S., Liu, H., Wang, Z., Yang, Y., Wang, S., Yu, G., Zhou, E. and Sun, J.,
“High-order information matters: Learning relation and topology for occluded person re-identification,”
in 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
6448
–6457
(2020). Google Scholar
Zhuo, J., Chen, Z., Lai, J. and Wang, G.,
“Occluded person re-identification,”
in 2018 IEEE Inter. Conf. on Multimedia and Expo (ICME),
1
–6
(2018). Google Scholar
Miao, J., Wu, Y., Liu, P., Ding, Y. and Yang, Y.,
“Pose-guided feature alignment for occluded person reidentification,”
in 2019 IEEE/CVF Inter. Conf. on Computer Vision (ICCV),
542
–551
(2019). Google Scholar
Fan, X., Luo, H., Zhang, X., He, L., Zhang, C. and Jiang, W.,
“SCPNet: Spatial-channel parallelism network for joint holistic and partial person re-identification,”
Computer Vision—ACCV 2018, 19
–34
(2019). https://doi.org/10.1007/978-3-030-20890-5 Google Scholar
Qi, L., Huo, J., Wang, L., Shi, Y. and Gao, Y.,
“MaskReID: A mask based deep ranking neural network for person re-identification,”
(2019). Google Scholar
Wang, G., Chen, X., Gao, J., Zhou, X. and Ge, S.,
“Self-guided body part alignment with relation transformers for occluded person re-identification,”
IEEE Signal Processing Letters, 28 1155
–1159
(2021). https://doi.org/10.1109/LSP.2021.3087079 Google Scholar
Li, Y., He, J., Zhang, T., Liu, X., Zhang, Y. and Wu, F.,
“Diverse part discovery: Occluded person reidentification with part-aware transformer,”
in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR),
2898
–2907
(2021). Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J.,
“Deep residual learning for image recognition,”
in 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
770
–778
(2016). Google Scholar
Zhou, K., Yang, Y., Cavallaro, A. and Xiang, T.,
“Omni-scale feature learning for person re-identification,”
in 2019 IEEE/CVF Inter. Conf. on Computer Vision (ICCV),
3701
–3711
(2019). Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S. and Yang, Y.,
“Random erasing data augmentation,”
(2017). Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z.,
“Rethinking the inception architecture for computer vision,”
in 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
2818
–2826
(2016). Google Scholar
Zhu, K., Guo, H., Liu, Z., Tang, M. and Wang, J.,
“Identity-guided human semantic parsing for person reidentification,”
Computer Vision—ECCV 2020, 346
–363
(2020). https://doi.org/10.1007/978-3-030-58580-8 Google Scholar
|