Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Conti, Emanuele; Salvi, Davide; Borrelli, Clara; Hosler, Brian; Bestagini, Paolo; Antonacci, Fabio; Sarti, Augusto; Tubaro, Stefano

doi:10.1109/icassp43922.2022.9747186

Cited by 29 publications

(16 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Works on how to include emotional cues in deepfake detection are starting to be Columns display the number of original recordings by the performer (OR), the number of performers of deepfakes on the original recordings of the current performer (PF), and number of fake recordings made on recordings by the performer (FRP). en, the number of photograms with emotions in common with performed [15][16][17], which indicates that emotional cues in video recordings can be used effectively for deepfake detection.…”

Section: Discussionmentioning

confidence: 99%

“…Consequently, the effect of bad neutral photograms in fakes, mainly in initial photograms, may suggest that emotion recognition is prone to work worse than in their corresponding original recordings. Works on how to include emotional cues in deepfake detection are starting to be performed [ 15 – 17 ], which indicates that emotional cues in video recordings can be used effectively for deepfake detection.…”

Section: Discussionmentioning

confidence: 99%

“…Literature on deepfakes and emotions is scarce. To the authors' knowledge, the only works that specifically tackle audio-visual deepfake detection using affective cues are [ 15 , 16 ], and [ 17 ]. Moreover, we could not find specific studies on how well deepfakes express emotions.…”

Section: Introductionmentioning

confidence: 99%

“…To the authors' knowledge, [ 15 , 16 ], and [ 17 ] are the only studies that exclusively address audio-visual deepfake detection using emotive cues. To detect falsifications or manipulations in the input video, the approach provided in [ 15 ] concurrently utilizes the audio (speech) and video (facial) modalities, as well as perceived emotion components collected from both modalities.…”

Section: Introductionmentioning

confidence: 99%

“…They used both modalities to detect similarity (or dissimilarity) between modality signals, and they discovered that perceived emotional information aids in detecting deepfake content. [ 16 , 17 ] offered a technique for detecting deepfakes based on semantic consistency in emotion, which was based on previous emotion identification work that extracted emotions over time from a subject's speech and faces separately. Synthesized voices or faces are then detected by analyzing these emotional signals.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Do Deepfakes Adequately Display Emotions? A Study on Deepfake Facial Emotion Expression

López-Gil

Gil

Garcı́a

2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Recent technological advancements in Artificial Intelligence make it easy to create deepfakes and hyper-realistic videos, in which images and video clips are processed to create fake videos that appear authentic. Many of them are based on swapping faces without the consent of the person whose appearance and voice are used. As emotions are inherent in human communication, studying how deepfakes transfer emotional expressions from original to fakes is relevant. In this work, we conduct an in-depth study on facial emotional expression in deepfakes using a well-known face swap-based deepfake database. Firstly, we extracted the photograms from their videos. Then, we analyzed the emotional expression in the original and faked versions of video recordings for all performers in the database. Results show that emotional expressions are not adequately transferred between original recordings and the deepfakes created from them. High variability in emotions and performers detected between original and fake recordings indicates that performer emotion expressiveness should be considered for better deepfake generation or detection.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Do Deepfakes Adequately Display Emotions? A Study on Deepfake Facial Emotion Expression

López-Gil

Gil

Garcı́a

2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

Attorresi

Salvi

Borrelli

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection

et al. 2023

Self Cite

View full text Add to dashboard Cite

With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material has become increasingly simple. Current technology enables the creation of videos where both the visual and audio contents are falsified. While the multimedia forensics community has begun to address this threat by developing fake media detectors. However, the vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult to detect without exploiting cross-modal inconsistencies (e.g., across the audio and visual tracks). One important reason for the lack of multimodal detectors is a similar lack of research datasets containing multimodal forgeries. Existing datasets typically contain only one falsified modality, such as deepfaked videos with authentic audio tracks, or synthetic audio with no associated video. Currently, datasets are needed that can be used to develop, train, and test these forensic algorithms. In this paper, we propose a new audio-visual deepfake dataset containing multimodal video forgeries. We present a general pipeline for synthesizing deepfake speech content from a given video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. We use this pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

show abstract

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Cited by 29 publications

References 24 publications

Do Deepfakes Adequately Display Emotions? A Study on Deepfake Facial Emotion Expression

Do Deepfakes Adequately Display Emotions? A Study on Deepfake Facial Emotion Expression

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection

Contact Info

Product

Resources

About