The Korean Sign Language Dataset for Action Recognition

Yang, Seung‐Man; Jung, Seungjun; Kang, Heekwang; Kim, Changick

doi:10.1007/978-3-030-37731-1_43

Cited by 12 publications

(14 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The NPU RGB+D dataset (Yang et al, 2019 ) is a unique multi-modal dataset that combines RGB (color) and depth information for sports action analysis across various sports, including basketball and football. Data preparation steps encompassed the synchronization of RGB videos with corresponding depth maps, ensuring temporal alignment.…”

Section: Methodsmentioning

confidence: 99%

Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Jiang,

2023

Front. Neurorobot.

View full text Add to dashboard Cite

IntroductionThis paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions. The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies. However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics.MethodsOur system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics. The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system's comprehension of opponent tactical information by establishing semantic associations between images and text. To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model's generalization performance on multimodal data.ResultsThrough cross-modal transfer learning, tactical information learned from images by the Swin Transformer is effectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights. Our method is rigorously tested and validated using Sport UV, Sports-1M, HMDB51, and NPU RGB+D datasets. Experimental results demonstrate the system's impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity. Notably, the system outperforms other models, with a remarkable 8.47% lower prediction error (MAE) on the Kinetics dataset, accompanied by a 72.86-second reduction in training time.DiscussionThe presented system proves to be highly suitable for real-time sports competition assistance and analysis, offering a novel and effective approach for an Intelligent Robot Sports Competition Tactical Analysis Model that maximizes the potential of multimodal perception technology. By harnessing the synergies between the Swin Transformer and CLIP models, we address the limitations of traditional methods and significantly advance the field of sports competition analysis. This innovative model opens up new avenues for comprehensive tactical analysis in sports, benefiting coaches, athletes, and sports enthusiasts alike.

show abstract

Section: Methodsmentioning

confidence: 99%

Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Jiang,

2023

Front. Neurorobot.

View full text Add to dashboard Cite

show abstract

“…Guo et al introduced a transformer model, CNN meets Transformer (CMT), by incorporating self-attention with CNN layers to efficiently extract multiscale features [20]. Shin et al further optimized CMT and reported 89.00% and accuracy for KSL-77 and for KSL-20 respectively [21], [22].…”

Section: Related Workmentioning

confidence: 99%

“…KSL is among the most widely used languages globally, and the KSL-77 and KSL 20 datasets are utilized in the study for evaluation [21], [22]. The KSL-77 dataset, which was collected from 20 individuals and includes 1,229 videos, from which 112,564 frames were extracted at a rate of 30 frames per second [22].…”

Section: A Ksl Datasetmentioning

confidence: 99%

“…However, fixed patch sizes are the most highlighted challenges in these models that are addressed by the CNN meeting Transformer (CMT) model [20]. Shin et al enhanced CMT to improve the performance accuracy of the KSL, and they reported 89.00% accuracy for KSL-77 and 98.00% for KSL-20 [21], which is 10% high accuracy compared to the previous model [22]. However, among the mentioned SLR systems, Various technologies were employed for specific culture-based SLR systems, e.g., KSL [21]- [23], ASL [1], [13], [21], [24]- [28] , BdSL [10], [29], [30], and JSL [7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hand Gesture Recognition for Multi-Culture Sign Language Using Graph and General Deep Learning Network

Miah,

Hasan,

Tomioka

et al. 2024

IEEE Open J. Comput. Soc.

View full text Add to dashboard Cite

Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. The absence of a universal sign language (SL) leads to diverse nationalities having various cultural SLs, such as Korean, American, and Japanese sign language. Existing SLR systems perform well for their cultural SL but may struggle with other or multi-cultural sign languages (McSL). To address these challenges, this paper introduces a novel end-to-end SLR system called GmTC, designed to translate McSL into equivalent text for enhanced understanding. Here, we employed a Graph and General deep-learning network as two stream modules to extract effective features. In the first stream, produce a graph-based feature by taking advantage of the superpixel values and the graph convolutional network (GCN), aiming to extract distance-based complex relationship features among the superpixel. In the second stream, we extracted long-range and short-range dependency features using attention-based contextual information that passes through multi-stage, multi-head self-attention (MHSA), and CNN modules. Combining these features generates final features that feed into the classification module. Extensive experiments with four culture SL datasets with high-performance accuracy compared to existing state-of-the-art models in individual domains affirming superiority and generalizability.

show abstract

“…This also requires effective feature extraction and classification algorithms for successful operation. To address this issue, some researchers employed a vision-based Korean Sign Language word recognition system using ANN [7], CNN [3], [4], Transformer [7], and Graph Convolutional Network (GCN) [8]. However, all existing vision-based KSL systems are designed exclusively for sign word recognition, and no research work has been found for KSL alphabet recognition.…”

Section: Introductionmentioning

confidence: 99%

Korean Sign Language Alphabet Recognition Through the Integration of Handcrafted and Deep Learning-Based Two-Stream Feature Extraction Approach

Shin,

Miah,

Akiba

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Recognizing sign language plays a crucial role in improving communication accessibility for the Deaf and hard-of-hearing communities. In Korea, many individuals facing hearing and speech challenges depend on Korean Sign Language (KSL) as their primary means of communication. Many researchers have been working to develop a sign language recognition system for other sign languages, but little research has been done for KSL alphabet recognition. However, existing KSL recognition systems have faced significant performance limitations due to the ineffectiveness of the features. To address these issues, we introduce an innovative KSL recognition system employing a strategic fusion approach. In this study, we combined joint skeleton-based handcrafted features and pixel-based resnet101 transfer learning features to overcome the limitations of traditional systems. Our proposed system consists of two distinct streams: the first stream extracts essential handcrafted features, placing emphasis on capturing hand orientation information within KSL gestures. In the second stream, concurrently, we employed a deep learning-based resnet101 module stream to capture hierarchical representations of the KSL alphabet sign. By combining essential information from the first stream with the hierarchical features from the second stream, we generate multiple levels of fused features with the goal of forming a comprehensive representation of KSL gestures. Finally, we fed the concatenated feature into the deep learning-based classification module for the classification. We conducted extensive experiments with the newly created KSL alphabet dataset, the existing KSL digit and the existing ArSL and ASL benchmark datasets. Our proposed model undeniably shows that our fusion approach substantially improves high-performance accuracy in both cases, which proves the system's superiority.INDEX TERMS Korean sign language (KSL), hand gesture recognition, geometric feature, distance feature, angle feature, ResNet.

show abstract

The Korean Sign Language Dataset for Action Recognition

Cited by 12 publications

References 15 publications

Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP

Hand Gesture Recognition for Multi-Culture Sign Language Using Graph and General Deep Learning Network

Korean Sign Language Alphabet Recognition Through the Integration of Handcrafted and Deep Learning-Based Two-Stream Feature Extraction Approach

Contact Info

Product

Resources

About