Abstract
Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to reframe the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step De-noising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. (c) Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, show-casing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at REFace.
Framework
Experiment
Conclusion
We proposed a train-time diffusion-based inpainting pipeline for face-swapping to obtain realistic swaps. Our introduction of a disentangled CLIP feature further improves the pose and expression perseverance. Furthermore, we propose a simple mask shuffling technique to even handle headswapping task. While our method significantly boosts both the performance (in qualitative and quantitative results) and efficiency (i.e. inference time and training cost), there is still room for improvement, especially under extreme pose and expression variations which we leave for future work.