CVPR 2021
Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting
Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, and Liang Lin
CVPR 2021

Abstract


Crowd counting is a fundamental yet challenging task, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only used the limited information of RGB images and cannot well discover potential pedestrians in unconstrained scenarios. In this work, we find that incorporating optical and ther-mal information can greatly help to recognize pedestrians. To promote future researches in this field, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) bench-mark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facili-tate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to capture the complementary information of different modalities fully. Specifically, our IADM incorporates two collaborative information transfers to dynamically enhance the modality-shared and modality-specific representations with a dual information propaga-tion mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the pro-posed approach is universal for multimodal crowd count-ing and is also capable to achieve superior performance on the ShanghaiTechRGBD [22] dataset. Finally, our source code and benchmark have been released at http://lingboliu.com/RGBT_Crowd_Counting.html.

Framework


Experiment


 

Conclusion


In this work, we propose to incorporate optical and ther-mal information to estimate crowd counts in unconstrained scenarios. To this end, we introduce the first RGBT crowd counting benchmark with 2,030 pairs of RGB-thermal im-ages and 138,389 annotated people. Moreover, we de-velop a cross-modal collaborative representation learning framework, which utilizes a tailor-designed Information Aggregation-Distribution Module to fully capture the com-plementary information of different modalities. Extensive experiments on two real-world benchmarks show the effec-tiveness and universality of the proposed method for multi-modal (e.g., RGBT and RGBD) crowd counting.