“…At the same time, research is moving towards the processing of multi-modal signals [20] and fusion approaches have been experimented in use cases like speech enhancement [21], emotion recognition [22], tracking of multiple speakers [23], action recognition [24] or scene classification [25]. Because no open synchronized audio-video dataset for transport applications exists, these techniques have not been experimented for violence detection in this environment.…”