In virtual teaching scenarios, head-mounted display (HMD) interactions often employ traditional controller and UI interactions, which are not very conducive to teaching scenarios that require hand training. Existing improvements in this area have primarily focused on replacing controllers with gesture recognition. However, the exclusive use of gesture recognition may have limitations in certain scenarios, such as complex operations or multitasking environments. This study designed and tested an interaction method that combines simple gestures with voice assistance, aiming to offer a more intuitive user experience and enrich related research. A speech classification model was developed that can be activated via a fistclenching gesture and is capable of recognising specific Chinese voice commands to initiate various UI interfaces, further controlled by pointing gestures. Virtual scenarios were constructed using Unity, with hand tracking achieved through the HTC OpenXR SDK. Within Unity, hand rendering and gesture recognition were facilitated, and interaction with the UI was made possible using the Unity XR Interaction Toolkit. The interaction method was detailed and exemplified using a teacher training simulation system, including sample code provision. Following this, an empirical test involving 20 participants was conducted, comparing the gesture-plus-voice operation to the traditional controller operation, both quantitatively and qualitatively. The data suggests that while there is no significant difference in task completion time between the two methods, the combined gesture and voice method received positive feedback in terms of user experience, indicating a promising direction for such interactive methods. Future work could involve adding more gestures and expanding the model training dataset to realize additional interactive functions, meeting diverse virtual teaching needs.