“…In the early days, most of the works focused on using RGB (or gray-scale) videos as inputs for HAR [5], due to their popularity in daily life. Recent years have witnessed an emergence of works using other data modalities [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], including skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., for HAR. This is mainly thanks to the development of different kinds of accurate and affordable sensors (such as Kinect), and the distinct advantages of different data modalities for HAR in various application scenarios.…”