FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection

Reichman University


Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training.

In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-manipulation and cross-dataset generalization. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., videos with no speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.

Cross-dataset generalization. Video-level ROC-AUC(%) when fine-tuning on FaceForensics++ and testing on FaceShifter (FSh), DeeperForensics (DFo), Celeb-DF-v2 (CDF) and Deepfake Detection Challenge (DFDC) in its two versions, full and filtered test-set. V, A indicate the usage of video and audio modalities during FakeOut fine-tuning, respectively. Best results are in bold.


Cross-manipulation generalization. Video-level ROC-AUC(%) when testing on each manipulation method of FaceForensics++, fine-tuning on the remaining three manipulations. Methods are Deepfakes (DF), FaceSwap (FS), Face2Face (F2F) and NeuralTextures (NT).


Results regarding dog-mask distractor category. We consider this category of videos as an aggressive post-process, and argue it should be left out of the filtered DFDC test-set. In the table, we report video-level ROC-AUC (%) when fine-tuning on FaceForensics++ and testing on Deepfake Detection Challenge (DFDC) dataset in its two versions, filtered and full test-set, leaving out heavily post-processed examples of dog-mask filter.

logo logo

Intra-dataset evaluation. We show the video-level performance, in terms of accuracy (%) and ROC-AUC (%), of several approaches, including FakeOut. All models are fine-tuned on the FaceForensics++ train-set and evaluated on the FaceForensics++ test-set.



  title={FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection},
  author = {Knafo, Gil and Fried, Ohad},
  journal={arXiv preprint arXiv:2212.00773},