The rampant use of forgery techniques poses a significant threat to the security of celebrities' identities. Although current deepfake detection methods have shown effectiveness when dealing with specific public face forgery datasets, their reliability diminishes when applied to open data. Moreover, these methods are susceptible to re-compression and mainly rely on pixel-level abnormalities in forgery faces.
In this study, we present a novel approach to detecting face forgery by leveraging individual speaking patterns of facial expressions and head movements. Our method utilizes potential motion patterns and inter-frame variations to effectively differentiate between fake and real videos. We propose an end-to-end dual-branch detection network, named the spatial-temporal transformer (STT), which aims to safeguard the identity of the person-of-interest (POI) from deepfaking. The STT incorporates the spatial transformer (ST) to establish the connection between facial expressions and head movements, while the temporal transformer (TT) exploits inconsistencies in facial attribute changes. Additionally, we introduce a central compression loss to enhance the detection performance.
Extensive experiments are conducted to evaluate the effectiveness of the STT, and the results demonstrate its superiority over other SOTA methods in detecting forgery videos involving POIs. Furthermore, our network exhibits resilience to pixel-level re-compression perturbations, making it a robust solution in the face of evolving forgery techniques.