“…A number of distinct input modalities have been employed to assist in egocentric action recognition. These include depth [7,151], egocentric cues comprising hand [66,129,134] and object regions [66,68,164], head motions [134] and gaze-based saliency maps [129,134], sensor-based modalities [101,114,139] and sound [18,74,167]. Typically, these methods require specialized sensors such as depth cameras, eye trackers, accelerometers, or inertial measurement units for the additional inputs, whereas sound is provided from the built-in microphones of the camera.…”