“…Given the increased availability of easy-to-deploy audio and video sensors and the improvements in computing facilities, in recent years, there has been a relevant growth in the number of proposals for multiple speakers tracking (Multiple-Object Tracking, MOT) in smart spaces, combining audio and video information [ 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 ]. Qian et al in [ 17 ] conducted an extensive review of state of the art in audiovisual speaker tracking.…”