Energy-guided test-time adaptation for data shifts in multi-modal perception

The Visual Computer

Yun Pei, Lingbo Liu, Runqing Jiang, Ye Zhang, Pengpeng Yu, Liang Lin, Yulan Guo

The Visual Computer

Abstract

In multi-modal perception tasks, test-phase data often suffers from environmental noise and sensor degradation, which causes distribution shifts from the training phase. Test-time adaptation (TTA) is an emerging unsupervised learning strategy that allows pre-trained models to adapt to new data distributions during testing without requiring source domain data. However, existing TTA methods, primarily designed for single-modal data, often struggle with multi-modal data shifts. They may rely on high-confidence pseudo-labels to update model parameters, leading to worse performance than before fine-tuning when all modalities are corrupted, and can suffer from catastrophic forgetting. To address these issues, we propose a Energy-guided Two-stage Test-time Adaptation (Eng2TTA) framework specifically designed for multi-modal perception. In the first stage, an energy-guided loss function is employed to optimize local model parameters by smoothing class distributions within each batch, thereby reducing overconfidence from noisy pseudo-labels. Concurrently, a memory bank is constructed to store the most representative high-confidence sample features for each class. In the second stage, predictions for low-confidence samples are refined by querying the memory bank using feature similarity, leveraging reliable high-confidence information without requiring additional parameter updates, which effectively mitigates catastrophic forgetting. Our method demonstrates superior robustness in multi-modal tasks, significantly outperforming state-of-the-art methods in scenarios with varying levels of modality corruption, particularly under severe distribution shifts.

Framework

Experiment

Conclusion

This paper introduces Eng2TTA, a two-stage Test-Time Adaptation framework specifically designed to tackle distribution shifts in multi-modal perception tasks. Recognizing the pitfalls of standard entropy minimization and pseudolabeling in noisy multi-modal settings, the proposed method introduces an energy-guided loss in the first stage. This smooths batch-wise predictions, fostering adaptation that is less dependent on potentially unreliable high-confidence samples. At the same time, a parameter-free memory bank is constructed to store the most confident class features encountered online. The second stage leverages the robust refinement of uncertain predictions through similarity matching, effectively decoupling refinement from parameter updates and thus preventing catastrophic forgetting. We conduct extensive experiments, and the results decisively show Eng2TTA’s superior ability to maintain performance under diverse and severe corruptions affecting single or multiple modalities.

中山大学人机物智能融合实验室 Human Cyber Physical Intelligence Integration Lab

hcp@sysu.edu.cn
广州市广州大学城外环东路132号

Official Account

News: Achievements; Activities; sharings; Talks

People: Faculty; Students; Alumni

Projects: Computer Vision; Multimodal; Robotics

Links: Git-Lab