CVPR 2025
Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
Ziliang Chen, Xin Huang, Xiaoxuan Fan, Keze Wang, Yuyu Zhou, Quanlong Guan, Liang Lin
CVPR 2025

Abstract


Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its generalization mechanism grasped massive research interests in the community. While existing studies limited in the scope of pre-training knowledge, hardly underpinned its generalization to countless open-world concepts absent from the pre-training regime. This paper dives into such Out-of-Pre- training (OOP) generalization problem from a holistic perspective. We propose LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, with regards to OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis evidences that despite image features of OOP concepts born with significant category margins, their zero-shot transfer significantly fails due to the poor image-text alignment. To this, we elaborate the “name-tuning” methodology with its theoretical merits in terms of OOP generalization, then propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority have been further verified in our comprehensive experiments.

 

 

Framework


 

 

 

 

 

 

Experiment


 

 

 

 

 

 

 

 

Conclusion


This research propose LAION-Beyond, not only illuminating the abilities and limitations of vision-language models in OOP concepts but also enlightens few-shot and zero-shot learning strategies to OOP generalization, contributing to the advancement of more adaptable multimodal systems.