Few-Shot Deepfake Face Detection Method Based on Vision-Language Model
YANG Hongyu1,2, LI Xinghang1, CHENG Xiang3, HU Ze1
1. School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300; 2. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300; 3. College of Information Engineering, Yangzhou University, Yangzhou 225127
摘要 针对现有深度伪造人脸检测方法在模型复杂性、样本量需求和应对新型深度伪造技术上的局限,提出基于视觉-语言模型的小样本深度伪造人脸检测方法(Few-Shot Deepfake Face Detection Method Based on Visual-Language Model, FDFD-VLM).基于CLIP(Contrastive Language-Image Pre-training),通过人脸区域提取与高频特征增强模块优化视觉特征,采用无类名-差异化Prompt优化模块提升Prompt适应性,利用CLIP编码结果优化模块强化多模态特征表示,通过三元组损失函数增强模型区分能力.实验表明,FDFD-VLM在多个深度伪造人脸数据集上的准确率较高,能在较少的训练样本下实现高效的深度伪造人脸检测.
Abstract:Aiming at the limitations of existing deepfake face detection methods in terms of model complexity, sample size requirements and adaptability to new deepfake techniques, a few-shot deepfake face detection method based on visual-language model(FDFD-VLM) is proposed. FDFD-VLM is built upon contrastive language-image pre-training(CLIP). Visual features are optimized through a face region extraction and high-frequency feature enhancement module. Prompt adaptability is improved by a classless differentiated prompt optimization module, while multimodal feature representation is strengthened by CLIP encoding attention optimization module. Additionally, a triplet loss function is introduced to improve the model discriminative capability. Experimental results demonstrate that FDFD-VLM outperforms existing methods on multiple deepfake face datasets and achieves efficient detection performance in few-shot deepfake face detection scenarios.
杨宏宇, 李星航, 成翔, 胡泽. 基于视觉-语言模型的小样本深度伪造人脸检测方法[J]. 模式识别与人工智能, 2025, 38(3): 205-220.
YANG Hongyu, LI Xinghang, CHENG Xiang, HU Ze. Few-Shot Deepfake Face Detection Method Based on Vision-Language Model. Pattern Recognition and Artificial Intelligence, 2025, 38(3): 205-220.
[1] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative Adversarial Networks. Communications of the ACM, 2020, 63(11): 139-144. [2] KINGMA D P, WELLING M.Auto-Encoding Variational Bayes[C/OL]. [2024-11-29]. https://arxiv.org/pdf/1312.6114. [3] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision. Proceedings of Machine Learning Research, 2021, 139: 8748-8763. [4] MIRZA M, OSINDERO S.Conditional Generative Adversarial Nets[C/OL]. [2024-11-29]. https://arxiv.org/pdf/1411.1784. [5] KARRAS T, LAINE S, AILA T.A Style-Based Generator Architecture for Generative Adversarial Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4396-4405. [6] NIRKIN Y, KELLER Y, HASSNER T.FSGAN: Subject Agnostic Face Swapping and Reenactment // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 7183-7192. [7] LIU M, DING Y K, XIA M, et al. STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 3668-3677. [8] XIA W H, YANG Y J, XUE J H, et al. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 2256-2265. [9] PERNUŠ M, ŠTRUC V, DOBRIŠEK S. MaskFaceGAN: High Re-solution Face Editing with Masked GAN Latent Code Optimization. IEEE Transactions on Image Processing, 2023, 32: 5893-5908. [10] HO J, JAIN A, ABBEEL P.Denoising Diffusion Probabilistic Mo-dels // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 6840-6851. [11] HUANG Z Q, CHAN K C K, JIANG Y M, et al. Collaborative Diffusion for Multi-modal Face Generation and Editing // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 6080-6090. [12] KIM M, LIU F, JAIN A, et al. DCFace: Synthetic Face Generation with Dual Condition Diffusion Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 12715-12725. [13] ZHAO W L, RAO Y M, SHI W K, et al. DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 8568-8577. [14] YE H, ZHANG J, LIU S B, et al. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models[C/OL].[2024-11-29]. https://arxiv.org/pdf/2308.06721. [15] LI Z, CAO M D, WANG X T, et al. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 8640-8650. [16] HALIASSOS A, VOUGIOUKAS K, PETRIDIS S, et al. Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 5037-5047. [17] QI H, GUO Q, JUEFEI-XU F, et al. DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms // Proc of the 28th ACM International Conference on Multimedia. New York, USA: ACM, 2020: 4318-4327. [18] MANDELLI S, BONETTINI N, BESTAGINI P, et al. Detecting GAN-Generated Images by Orthogonal Training of Multiple CNNs // Proc of the IEEE International Conference on Image Processing. Washington, USA: IEEE, 2022: 3091-3095. [19] LI L Z, BAO J M, ZHANG T, et al. Face X-Ray for More General Face Forgery Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 5000-5009. [20] WANG Z D, BAO J M, ZHOU W G, et al. DIRE for Diffusion-Generated Image Detection // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 22388-22398. [21] QIAN Y Y, YIN G J, SHENG L, et al. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues // Proc of the European Conference on Computer Vision. Berlin, Ger-many: Springer, 2020: 86-103. [22] GAO J, XIA Z Q, MARCIALIS G L, et al. DeepFake Detection Based on High-Frequency Enhancement Network for Highly Compressed Content. Expert Systems with Applications, 2024, 249. DOI: 10.1016/j.eswa.2024.123732. [23] WOLTER M, BLANKE F, HEESE R, et al. Wavelet-Packets for Deepfake Image Analysis and Detection. Machine Learning, 2022, 111(11): 4295-4327. [24] MASI I, KILLEKAR A, MASCARENHAS R M, et al. Two-Branch Recurrent Network for Isolating DeepFakes in Videos // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 667-684. [25] GU Z H, YAO T P, CHEN Y, et al. Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection // Proc of the 31st International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2022: 920-926. [26] GUARNERA L, GIUDICE O, BATTIATO S.Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Di-ffusion-Model Images. ACM Transactions on Multimedia Computing, Communications and Applications, 2024. DOI: 10.1145/3652027. [27] OJHA U, LI Y H, LEE Y J.Towards Universal Fake Image Detectors That Generalize across Generative Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 24480-24489. [28] KHAN S A, DANG-NGUYEN D T. CLIPping the Deception: Adap-ting Vision-Language Models for Universal Deepfake Detection // Proc of the International Conference on Multimedia Retrieval. New York, USA: ACM, 2024: 1006-1015. [29] ZOU B, YANG C, GUAN J Z, et al. DFCP: Few-Shot DeepFake Detection via Contrastive Pretraining // Proc of the IEEE International Conference on Multimedia and Expo. Washington, USA: IEEE, 2023: 2303-2308. [30] KING D E.Dlib-ml: A Machine Learning Toolkit. The Journal of Machine Learning Research, 2009, 10: 1755-1758. [31] FRANK J, EISENHOFER T, SCHÖNHERR L, et al. Leveraging Frequency Analysis for Deep Fake Image Recognition. Proceedings of Machine Learning Research, 2020, 119: 3247-3258. [32] ZHOU K Y, YANG J K, LOY C C, et al. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 2022, 130(9): 2337-2348. [33] SCHROFF F, KALENICHENKO D, PHILBIN J.FaceNet: A Unified Embedding for Face Recognition and Clustering // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 815-823. [34] DANG H, LIU F, STEHOUWER J, et al. On the Detection of Digital Face Manipulation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 5780-5789. [35] NEVES J C, TOLOSANA R, VERA-RODRIGUEZ R, et al. GAN-printR: Improved Fakes and Evaluation of the State of the Art in Face Manipulation Detection. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(5): 1038-1048. [36] KARRAS T, LAINE S, AITTALA M, et al. Analyzing and Improving the Image Quality of StyleGAN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 8107-8116. [37] CHEN Z X, SUN K, ZHOU Z Y, et al. DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis[C/OL].[2024-11-29]. https://arxiv.org/pdf/2403.18471. [38] SONG J M, MENG C L, ERMON S.Denoising Diffusion Implicit Models[C/OL]. [2024-11-29].https://arxiv.org/pdf/2010.02502v1. [39] LIU L P, REN Y, LIN Z J, et al. Pseudo Numerical Methods for Diffusion Models on Manifolds[C/OL].[2024-11-29]. https://arxiv.org/pdf/2202.09778. [40] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 10674-10685. [41] DHARIWAL P, NICHOL A.Diffusion Models Beat GANs on Image Synthesis // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 8780-8794. [42] LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2022: 11451-11461. [43] KINGMA D P, BA J.Adam: A Method for Stochastic Optimization[C/OL]. [2024-11-29]. https://arxiv.org/pdf/1412.6980. [44] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Loca-lization // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 618-626.