The Visual Computer
Energy-guided test-time adaptation for data shifts in multi-modal perception
Yun Pei, Lingbo Liu, Runqing Jiang, Ye Zhang, Pengpeng Yu, Liang Lin, Yulan Guo
The Visual Computer

Abstract


In multi-modal perception tasks, test-phase data often suffers from environmental noise and sensor degradation, which causes distribution shifts from the training phase. Test-time adaptation (TTA) is an emerging unsupervised learning strategy that allows pre-trained models to adapt to new data distributions during testing without requiring source domain data. However, existing TTA methods, primarily designed for single-modal data, often struggle with multi-modal data shifts. They may rely on high-confidence pseudo-labels to update model parameters, leading to worse performance than before fine-tuning when all modalities are corrupted, and can suffer from catastrophic forgetting. To address these issues, we propose a Energy-guided Two-stage Test-time Adaptation (Eng2TTA) framework specifically designed for multi-modal perception. In the first stage, an energy-guided loss function is employed to optimize local model parameters by smoothing class distributions within each batch, thereby reducing overconfidence from noisy pseudo-labels. Concurrently, a memory bank is constructed to store the most representative high-confidence sample features for each class. In the second stage, predictions for low-confidence samples are refined by querying the memory bank using feature similarity, leveraging reliable high-confidence information without requiring additional parameter updates, which effectively mitigates catastrophic forgetting. Our method demonstrates superior robustness in multi-modal tasks, significantly outperforming state-of-the-art methods in scenarios with varying levels of modality corruption, particularly under severe distribution shifts.

 

 

Framework


 

 

 

Experiment


 

 

Conclusion


This paper introduces Eng2TTA, a two-stage Test-Time Adaptation framework specifically designed to tackle distribution shifts in multi-modal perception tasks. Recognizing the pitfalls of standard entropy minimization and pseudolabeling in noisy multi-modal settings, the proposed method introduces an energy-guided loss in the first stage. This smooths batch-wise predictions, fostering adaptation that is less dependent on potentially unreliable high-confidence samples. At the same time, a parameter-free memory bank is constructed to store the most confident class features encountered online. The second stage leverages the robust refinement of uncertain predictions through similarity matching, effectively decoupling refinement from parameter updates and thus preventing catastrophic forgetting. We conduct extensive experiments, and the results decisively show Eng2TTA’s superior ability to maintain performance under diverse and severe corruptions affecting single or multiple modalities.