“…As touched upon a bit earlier, audio-driven deep fakes can be categorised by whether they are generated by leveraging an audio driven structural representation of the face, or without. There have been numerous approaches over the years relating to the former, ranging from ones such as [2,7,10,16,19,31,39,56,66,68,74,75,79,91] which generate a set of 2D facial landmark co-ordinates from audio, or [8,15,32,37,52,62,63,69,76,77,[83][84][85]87] which predict expression parameters from audio to drive a 3D face model. What these approaches all have in common is that they use these intermediate structural representations as input to a separate neural rendering model which is typically trained as an image to image translation task to generate the final photo realistic image frame.…”