Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.
One common computer vision task is to track an object as it moves from frame to frame within a video sequence. There are a myriad of applications for such capability and the underlying technologies to achieve this tracking are very well understood. More recently, deep convolutional neural networks have been employed to not only track, but also to classify objects as they are tracked from frame to frame. These models can be used in a tracking paradigm known as tracking by detection and can achieve very high tracking accuracy. The major drawback to these deep neural networks is the large amount of mathematical operations that must be performed for each inference which negatively impacts the number of tracked frames per second. For edge applications residing on size, weight, and power limited platforms, such as unmanned aerial vehicles, high frame rate and low latency real time tracking can be an elusive target. To overcome the limited power and computational resources of an edge compute device, various optimizations have been performed to trade off tracking speed, accuracy, power, and latency. Previous works on motion based interpolation with neural networks either do not take into account the latency accrued from camera image capture to tracking result or they compensate for this latency but are bottlenecked by the motion interpolation operation instead. The algorithm presented in this work gains the performance speedup used in previous motion based neural network inference papers and also performs a novel look back operation that is less cumbersome than other competing motion interpolation methods.INDEX TERMS CNN, classifier, detector, neural network, low latency, tracker, UAV, YOLO, look back, drone, image processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.