Overview
We offer a benchmark suite together with an evaluation server, such that authors can upload their results and get a ranking. We offer a Dataset that contains more than 50000 pictures, including 30462 images for training set, 10000 images for validation set and 10000 images for test set. If you would like to submit your results, please register, login, and follow the instructions on our submission page.
Note: We only display the highest submission of each person.
Single-Person Human Parsing Track
Metrics
We use four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). The metrics are reported by FCN.
The four metrics are Pixel accuracy(%) , Mean accuracy(%), Mean IoU(%) and Frequency weighted IoU(%). The details show per-class IoU(%).
User Id |
Method |
Pixel accuracy |
Mean accuracy |
Mean IoU |
Frequency weighted IoU |
Details |
Abbreviation |
Submit Time |
41 |
WhiskNet |
86.16 |
57.95 |
47.74 |
76.45 |
Details
background |
hat |
hair |
glove |
sunglasses |
upper-clothes |
dress |
coat |
socks |
pants |
86.66 |
62.69 |
70.76 |
36.41 |
15.54 |
68.66 |
37.45 |
55.64 |
40.54 |
72.86 |
jumpsuits |
scarf |
skirt |
face |
left-arm |
right-arm |
left-leg |
right-leg |
left-shoe |
right-shoe |
26.23 |
18.67 |
29.57 |
73.46 |
57.09 |
58.50 |
41.72 |
41.34 |
30.17 |
30.88 |
|
Abbreviation
Contributors |
Description |
Haoshu Fang, Yuwing Tai, Cewu Lu |
It has been demonstrated that multi-scale features are useful to improve the performance of semantic segmentation. However, without careful design of network architectures, deep models such as ResNet-101 cannot fully utilize the atrous convolution structure proposed in [1] to leverage the advantage of multi-scale features. In this work, we propose 'WhiskNet', which utilizes building blocks of ResNet, to extract and incorporate very deep multi-scale features into a single network model. Moreover, 'WhiskNet' adds an extra 'Multi-atrous-convolution' for each scale which achieves excellent performance when merging multi-scale features.
[1]Attention to Scale: Scale-aware Semantic Image Segmentation Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu and Alan L Yuille CVPR 2016 |
|
2017-06-03 14:22:05 |
46 |
Self-Supervised Neural Aggregation Networks |
87.29 |
63.35 |
52.26 |
78.25 |
Details
background |
hat |
hair |
glove |
sunglasses |
upper-clothes |
dress |
coat |
socks |
pants |
87.91 |
66.54 |
73.14 |
40.23 |
27.30 |
70.47 |
39.09 |
58.03 |
44.25 |
74.52 |
jumpsuits |
scarf |
skirt |
face |
left-arm |
right-arm |
left-leg |
right-leg |
left-shoe |
right-shoe |
29.94 |
24.04 |
32.51 |
75.56 |
58.84 |
60.74 |
51.82 |
52.16 |
39.44 |
38.74 |
|
Abbreviation
Contributors |
Description |
ZHAO Jian (NUS & NUDT), NIE Xuecheng (NUS), XIAO Huaxin (NUS & NUDT), CHEN Yunpeng (NUS), LI Jianshu (NUS), YAN Shuicheng (NUS & Qihoo360 AI Institute) (The first 3 authors are with equal contributions.) |
We present a Self-Supervised Neural Aggregation Network (SS-NAN) for human parsing. SS-NAN adaptively learns to aggregate the multi-scale features at each pixel "address". In order to further improve the feature discriminative capacity, a self-supervised joint loss is adopted as an auxiliary learning strategy, which imposes human joint structures into parsing results without resorting to extra supervision. The proposed SS-NAN is end-to-end trainable. SS-NAN can be integrated into any advanced neural networks to help aggregate features regarding the importance at different positions and scales and incorporate rich high-level knowledge regarding human joint structures from a global perspective, which in turn improve the parsing results. Moreover, to further boost the overall performance of SS-NAN for human parsing, we also leverage a robust multi-view strategy with different state-of-the-art backbone models. |
|
2017-06-04 13:23:59 |
107 |
BUPTMM-Parsing |
84.93 |
55.62 |
45.44 |
74.60 |
Details
background |
hat |
hair |
glove |
sunglasses |
upper-clothes |
dress |
coat |
socks |
pants |
85.20 |
60.25 |
68.60 |
32.11 |
21.38 |
66.51 |
32.41 |
55.08 |
35.01 |
69.19 |
jumpsuits |
scarf |
skirt |
face |
left-arm |
right-arm |
left-leg |
right-leg |
left-shoe |
right-shoe |
25.37 |
14.77 |
24.92 |
72.12 |
53.75 |
55.82 |
39.82 |
40.16 |
28.33 |
27.97 |
|
Abbreviation
Contributors |
Description |
Peng Cheng, Xiaodong Liu, Peiye Liu, Wu Liu |
We revised and finetuned the Attention+SSL [1] and Attention to Scale [2] on LIP training set. Then we combined the two models with different fusion strategies.
[1] "Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing", Ke Gong, Xiaodan Liang, Xiaohui Shen, Liang Lin, CVPR 2017.
[2] Attention to Scale: Scale-aware Semantic Image Segmentation Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu and Alan L Yuille CVPR 2016 |
|
2017-06-04 14:54:06 |
52 |
VSNet-SLab+Samsung |
87.06 |
66.73 |
54.13 |
77.98 |
Details
background |
hat |
hair |
glove |
sunglasses |
upper-clothes |
dress |
coat |
socks |
pants |
87.31 |
66.42 |
72.11 |
43.17 |
31.09 |
69.09 |
41.65 |
56.68 |
42.70 |
74.42 |
jumpsuits |
scarf |
skirt |
face |
left-arm |
right-arm |
left-leg |
right-leg |
left-shoe |
right-shoe |
31.98 |
21.69 |
33.29 |
74.13 |
62.64 |
64.35 |
59.03 |
59.40 |
45.69 |
45.68 |
|
Abbreviation
Contributors |
Description |
Lejian Ren[1],Renda Bao[1],Yao Sun[1],Si Liu[1] and Yinglu Liu[2],Yanli Li[2],Junjun Xiong[2]
[1]IIE,CAS
[2]Beijing Samsung Telecom R&D Center |
We have proposed a view-specific contextual human parsing method. It has two core contributions. (1) The model has a cascade structure including a view classifier and the corresponding human parsing model w.r.t the specific view. The view classifier predicts whether the human is in frontal or back view. The view groundtruth is automatically generated by analyzing the parsing groundtruth with human knowledge. We observe that the IoUs of left/right legs, left/right shoes are significantly boosted in the validation set. (2) We train a category classifier to estimate the labels of the images[1]. The classification results serve as the context of the parsing and boost the performances. Two human parsing models based on RefineNet[2] and PSPnet [3] are implemented. The best results were obtained by combining them. No extra datasets were used. [1] Human Parsing With Contextualized Convolutional Neural Network, Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, Shuicheng Yan. TPAMI, 2016 [2] RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation, Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid, CVPR 2017 [3] Pyramid Scene Parsing Network, Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia. CVPR 2017 |
|
2017-06-04 15:14:38 |