Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting

CVPR 2021

Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, and Liang Lin

CVPR 2021

Abstract

Crowd counting is a fundamental yet challenging task, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only used the limited information of RGB images and cannot well discover potential pedestrians in unconstrained scenarios. In this work, we ﬁnd that incorporating optical and ther-mal information can greatly help to recognize pedestrians. To promote future researches in this ﬁeld, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) bench-mark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facili-tate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-speciﬁc branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to capture the complementary information of different modalities fully. Speciﬁcally, our IADM incorporates two collaborative information transfers to dynamically enhance the modality-shared and modality-speciﬁc representations with a dual information propaga-tion mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the pro-posed approach is universal for multimodal crowd count-ing and is also capable to achieve superior performance on the ShanghaiTechRGBD [22] dataset. Finally, our source code and benchmark have been released at http://lingboliu.com/RGBT_Crowd_Counting.html.

Framework

Experiment

Conclusion

In this work, we propose to incorporate optical and ther-mal information to estimate crowd counts in unconstrained scenarios. To this end, we introduce the ﬁrst RGBT crowd counting benchmark with 2,030 pairs of RGB-thermal im-ages and 138,389 annotated people. Moreover, we de-velop a cross-modal collaborative representation learning framework, which utilizes a tailor-designed Information Aggregation-Distribution Module to fully capture the com-plementary information of different modalities. Extensive experiments on two real-world benchmarks show the effec-tiveness and universality of the proposed method for multi-modal (e.g., RGBT and RGBD) crowd counting.

中山大学人机物智能融合实验室 Human Cyber Physical Intelligence Integration Lab

hcp@sysu.edu.cn
广州市广州大学城外环东路132号

Official Account

News: Achievements; Activities; sharings; Talks

People: Faculty; Students; Alumni

Projects: Computer Vision; Multimodal; Robotics

Links: Git-Lab