A common scenario found in experimentation is to synchronize events, such as breaks between visual stimulus, with the video record taken of an experiment made of participants as they undertake the task. In our case, we recently synchronized a protocol of stimulus presentations shown on a laptop display, with webcam video made of participants’ (who were two year old children) facial and eye movements as they were shown trials of stimulus containing moving dots (a random dot kinematogram or RDK). The purpose was to assess eye movements in response to these RDK stimulus as a part of a potential neurological assessment for children. The video contained audio signals such as “beeps” and musical interludes that indicated the start and end of trials, thereby providing a convenient opportunity to align these audio events with the timing of known events in the video record.
The process of alignment can be performed manually, but this is a tedious and time consuming task when considering, for example, large databases of videos. In this paper, we tested two alternate methods for synchronizing known audio events using: 1) a deep learning based model, and a 2) standard template matching algorithm. These methods were used to synchronize the known protocol of stimulus events in videos by processing the audio contents of the recording. The deep learning approach utilized simple mel-spectrum audio signal feature extraction, whilst we adopted a cross-correlation algorithm that detected an audio template in the time domain. We found that whilst correlation was not effective as a means of beep detection; but our machine learning-based technique was robust with 90% accuracy in the testing dataset and did not the same amount of remediation required of the correlation approach.