Our proposed ExposeAnyone approach is self-supervised, i.e., manipulation-agnostic, yet achieves state-of-the-art generalizability and robustness.
Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision.
In this paper, we propose ExposeAnyone, a fully self-supervised approach based on diffusion models that generate expression sequences from audio. Our key finding is that, given the reference sets of specific subjects to be authenticated, our personalized audio-to-diffusion models distinguish real videos of the subjects from deepfakes mimicking them.
Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.
The blue and red lines represent the fakeness scores over the frame index of a real video and that of a deepfake video mimicking the subject, respectively. Our model exposes the talking-identity discrepancy between the personalized subject and manipulated subject.
On our Sora2 Cameo Forensics Preview (S2CFP) dataset
On DF-TIMIT dataset [Korshunov+, arXiv18]
On DFDCP dataset [Dolhansky+, arXiv20]
On KoDF dataset [Kwon+, ICCV21]
On IDForge dataset [Xu+, MM24]
We evaluate the generalization ability to unseen manipulations on DF-TIMIT, DFDCP, KoDF, and IDForge dataset. Our method achieves the best average AUC.
Our model is also capable of detecting Sora2-generated videos while previous methods perform poorly.
Our method is highly robust to the corruptions, especially jpeg and video compression, which detectors encounter more frequently in real-world scenarios.
@article{shiohara2026exposeanyone,
author = {Shiohara, Kaede and Yamasaki, Toshihiko and Golyanik, Vladislav},
title = {ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors},
journal = {arXiv:2601.02359}
year = {2026},
}