ExposeAnyone: Personalized Audio-to-Expression Diffusion Models are Robust Zero-Shot Face Forgery Detectors

1The University of Tokyo, 2Max Planck Institute for Informatics
Interpolation end reference image.

ExposeAnyone is a fully self-supervised person-of-interest face forgery detector. We first pre-train our audio-to-expression diffusion model on a large scale unlabeled video collection. Then, we personalize our pre-trained model on a single or more reference videos of a subject by inserting a subject-specific adapter. Finally, we expose deepfake videos by the diffusion reconstruction distance. Desite self-supervision, our method generalizes to a wide range of manipulations from traditional deepfakes to the latest video generation model, Sora2.

Video

Why is this work important?

  1. Previous work mainly focuses on supervised learning with real and fake videos. However, it is well-known that such methods overfit to manipulation-specific artifacts and fail to detect unknown manipulations. Pseudo-fake augmentation strategies such as Face X-ray [Li+, CVPR20] and SBI [Shiohara+, CVPR22] mitigate the overfitting problem but could not cover all forgery patterns.
  2. Self-supervised learning such as AVAD [Feng+, CVPR23] and POI-Forensics [Cozzolino+, CVPRW23] is promising because they do not overfit to any manipulation-specific artifacts. However, existing self-supervised methods fail to learn effective features that can divide accurately real and fake classes only from self-supervision. As a result, current self-supervised methods are significantly inferior to supervised methods.

Our proposed ExposeAnyone approach is self-supervised, i.e., manipulation-agnostic, yet achieves state-of-the-art generalizability and robustness.

Abstract

Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision.

In this paper, we propose ExposeAnyone, a fully self-supervised approach based on diffusion models that generate expression sequences from audio. Our key finding is that, given the reference sets of specific subjects to be authenticated, our personalized audio-to-diffusion models distinguish real videos of the subjects from deepfakes mimicking them.

Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.

Fakeness Visualization on Real and Fake Videos

The blue and red lines represent the fakeness scores over the frame index of a real video and that of a deepfake video mimicking the subject, respectively. Our model exposes the talking-identity discrepancy between the personalized subject and manipulated subject.

On our Sora2 Cameo Forensics Preview (S2CFP) dataset

On DF-TIMIT dataset [Korshunov+, arXiv18]

On DFDCP dataset [Dolhansky+, arXiv20]

On KoDF dataset [Kwon+, ICCV21]

On IDForge dataset [Xu+, MM24]

Comparison with Previous State-of-the-Arts

Generalization to Traditional Deepfakes

We evaluate the generalization ability to unseen manipulations on DF-TIMIT, DFDCP, KoDF, and IDForge dataset. Our method achieves the best average AUC.


Generalization to Sora2

Our model is also capable of detecting Sora2-generated videos while previous methods perform poorly.


Robustness to Common Corruptions

Our method is highly robust to the corruptions, especially jpeg and video compression, which detectors encounter more frequently in real-world scenarios.

BibTeX

@article{shiohara2026exposeanyone,
  author    = {Shiohara, Kaede and Yamasaki, Toshihiko and Golyanik, Vladislav},
  title     = {ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors},
  journal   = {arXiv:2601.02359}
  year      = {2026},
}