Recognizing hand-object interactions presents a significant challenge in computer vision. It arises due to the varying nature of hand-object interactions. Moreover, estimating the 3D position of a hand from a single frame can be problematic, especially when the hand obstructs the view of the object from the observer’s perspective. In this article, we present a novel approach to recognizing objects and facilitating virtual interactions, using a steering wheel as an illustrative example. We propose a real-time solution for identifying hand-object interactions in eXtended reality (XR) environments. Our approach relies on data captured by a single RGB camera during a manipulation scenario involving a steering wheel. Our model pipeline consists of three key components: (a) a hand landmark detector based on the MediaPipe cross-platform hand tracking solution; (b) a three-spoke steering wheel model tracker implemented using the faster region-based convolutional neural network (Faster R-CNN) architecture; and (c) a gesture recognition module designed to analyze interactions between the hand and the steering wheel. This approach not only offers a realistic experience of interacting with steering-based mechanisms but also contributes to reducing emissions in the real-world environment. Our experimental results demonstrate the natural interaction between physical objects in virtual environments, showcasing precision and stability in our system.