2020
DOI: 10.3390/informatics7030031
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Hand Gesture Classification for the Human–Car Interaction

Abstract: The recent spread of low-cost and high-quality RGB-D and infrared sensors has supported the development of Natural User Interfaces (NUIs) in which the interaction is carried without the use of physical devices such as keyboards and mouse. In this paper, we propose a NUI based on dynamic hand gestures, acquired with RGB, depth and infrared sensors. The system is developed for the challenging automotive context, aiming at reducing the driver’s distraction during the driving activity. Specifically, the proposed f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
16
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(16 citation statements)
references
References 45 publications
0
16
0
Order By: Relevance
“…On the other hand, previous works focused on learned feature extraction methods tend to use deep convolutional neural networks (CNNs) [6,12,26] and 3D-CNNs [10,11,13,27] with a variety of different input modalities, such as RGB, depth [6], OF [15], infrared [26], and even surface electromyography signals [28,29] (using armbands to sense electrical activity of skeletal muscles). Specifically, multi-stream architectures based on different versions of the same video with two or more CNNs in parallel, have been widely employed [9][10][11][12][13]26,27]. The seminal work of Karpathy et al [30] establishes the trend of using two-stream architectures for action recognition, which originally combines features from low-resolution frames and high-resolution cues from the center of the frame.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, previous works focused on learned feature extraction methods tend to use deep convolutional neural networks (CNNs) [6,12,26] and 3D-CNNs [10,11,13,27] with a variety of different input modalities, such as RGB, depth [6], OF [15], infrared [26], and even surface electromyography signals [28,29] (using armbands to sense electrical activity of skeletal muscles). Specifically, multi-stream architectures based on different versions of the same video with two or more CNNs in parallel, have been widely employed [9][10][11][12][13]26,27]. The seminal work of Karpathy et al [30] establishes the trend of using two-stream architectures for action recognition, which originally combines features from low-resolution frames and high-resolution cues from the center of the frame.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, Hakim et al [27] proposed to fuse RGB and depth spatio-temporal features (extracted with 3D-CNNs and LSTM recurrent neural networks) with a Finite State Machine that restricts some gesture flows and limits the recognition classes. D'Eusanio et al [26], on the other hand, proposed an early-fusion approach of RGB, depth, and infrared modalities based on a modification of the very deep DenseNet-161 architecture. Concurrently, Kopuklu et al [13] proposed a hierarchical structure of 3D-CNN architectures to detect and classify continuous hand gestures.…”
Section: Related Workmentioning
confidence: 99%
“…Their increasing popularity has been supported by the spread of inexpensive, but still accurate, active depth sensors and their ability to operate in dark or in low-light conditions, thanks to the presence of infrared light or laser emitter [ 10 ]. For instance, in the automotive scenario [ 11 , 12 ], depth sensors represent an effective solution to run non-invasive and vision-based algorithms, such as face verification [ 13 ], head pose estimation [ 14 ], or gesture recognition [ 15 ]. More generally, starting from the first release of the Microsoft Kinect device, depth cameras have enabled new interaction modalities between the users and the environment.…”
Section: Introductionmentioning
confidence: 99%
“…Scientifically speaking, Human Pose Estimation (HPE) refers to the the method of localizing the human body parts (3D pose) or their projection onto a picture plane (2D pose). Video-based HPE has attracted increasing interest in recent years thanks to its wide range of applications including: human-computer interaction [1,2], sports performance analysis [3], and video surveillance [4][5][6]. Although the research has advanced in this field, there are still many remaining challenges such as: the high changes in human body shapes, clothing and viewpoint variations, and the conditions of system acquisition (day and night illumination variations, occlusions, etc.…”
Section: Introductionmentioning
confidence: 99%