VideoQA in the Era of LLMs: An Empirical Study

International Journal of Computer Vision 2025

Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua & Angela Yao

International Journal of Computer Vision 2025

Abstract

Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs’ behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs’ QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.

Framework

Experiment

Conclusion

This paper has vetted Video-LLMs’ performance in VideoQA by probing their success and failure modes with well-designed adversarial tests. While Video-LLMs generally show better QA accuracy, the decrease rates and flip rates align with or even higher than nonLLM methods when faced with adversarial challenges. Specifically, Video-LLMs show significant limitation in coping with video temporality, of both reasoning the chronological content order and grounding the temporal moments to substantiate the answers. Also, they are unresponsive towards video perturbation while being susceptible to simple language variations of questions and candidate answers. Additionally, Video-LLMs, after finetuned, are not necessarily generalize better. Understanding these limitations is crucial for developing future Video-LLMs and VideoQA techniques, where we also conclude some promising directions to proceed.

中山大学人机物智能融合实验室 Human Cyber Physical Intelligence Integration Lab

hcp@sysu.edu.cn
广州市广州大学城外环东路132号

Official Account

News: Achievements; Activities; sharings; Talks

People: Faculty; Students; Alumni

Projects: Computer Vision; Multimodal; Robotics

Links: Git-Lab