Crowd counting is a fundamental yet challenging task, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only used the limited information of RGB images and cannot well discover potential pedestrians in unconstrained scenarios. In this work, we ﬁnd that incorporating optical and ther-mal information can greatly help to recognize pedestrians. To promote future researches in this ﬁeld, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) bench-mark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facili-tate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-speciﬁc branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to capture the complementary information of different modalities fully. Speciﬁcally, our IADM incorporates two collaborative information transfers to dynamically enhance the modality-shared and modality-speciﬁc representations with a dual information propaga-tion mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the pro-posed approach is universal for multimodal crowd count-ing and is also capable to achieve superior performance on the ShanghaiTechRGBD  dataset. Finally, our source code and benchmark have been released at http://lingboliu.com/RGBT_Crowd_Counting.html.
In this work, we propose to incorporate optical and ther-mal information to estimate crowd counts in unconstrained scenarios. To this end, we introduce the ﬁrst RGBT crowd counting benchmark with 2,030 pairs of RGB-thermal im-ages and 138,389 annotated people. Moreover, we de-velop a cross-modal collaborative representation learning framework, which utilizes a tailor-designed Information Aggregation-Distribution Module to fully capture the com-plementary information of different modalities. Extensive experiments on two real-world benchmarks show the effec-tiveness and universality of the proposed method for multi-modal (e.g., RGBT and RGBD) crowd counting.