“…Action classification datasets include Kinetics, a video dataset for human action classification (Kay et al, 2017), ActivityNet, a video dataset for action classification and temporal localization (Fabian Caba Heilbron and Niebles, 2015), and AVA, a dataset of spatio-temporally localized atomic visual actions (AVA) . Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection (Roth et al, 2019), VGG lip reading dataset, an audio-visual dataset for speech recognition and separation , Mosi, a multimodal corpus of sentiment intensity (Zadeh et al, 2017(Zadeh et al, , 2016, and OpenFace, a multi-modal face recognition (Baltrušaitis et al, 2016). The two major advantages of EgoCom are egocentricity and the inclusion of multiple participant's synchronized audio and video, which as we show, simplifies multi-speaker applications.…”