Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

HomeComputer Vision

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

Abstract

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross- modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%.

Framework

Experiment

Conclusion

In this work, we present HELIP, a framework that enhances pre-trained CLIP models by more effectively utilizing their original training datasets. HELIPoffers a cost-effective and easily integrable solution for improving existing models without extensive retraining or additional data. Specifically, it treats each text-image pair as a point in the joint vision-language space and identifies hard pairs, those that are close together, using the Hard Pair Mining (HPM) strategy. Furthermore, to efficiently leverage these hard pairs, we introduce the Hard Negative Margin Loss (HNML). Empirically, we found that HELIPboosts the performance of existing checkpoints within a few epochs of continuous training. Evaluations across various benchmarks, including zero-shot classification, imagetext retrieval, and linear probing, demonstrate the effectiveness and efficiency of our method. These findings highlight that in the era of large-scale models and datasets, performance improvement can be achieved not only by collecting more data or scaling up models, but also by intelligently maximizing the utility of the data we already have.

Acknowledgement

References

中山大学人机物智能融合实验室 Human Cyber Physical Intelligence Integration Lab

hcp@sysu.edu.cn
广州市广州大学城外环东路132号

Official Account

News: Achievements; Activities; sharings; Talks

People: Faculty; Students; Alumni

Projects: Computer Vision; Multimodal; Robotics

Links: Git-Lab