Many VR-based medical purposes applications have been developed to help patients with mobility decrease caused by accidents, diseases, or other injuries to do physical treatment efficiently. VR-based applications were considered more effective helper for individual physical treatment because of their low-cost equipment and flexibility in time and space, less assistance of a physical therapist. A challenge in developing a VR-based physical treatment was understanding the body part movement accurately and quickly. We proposed a robust pipeline to understanding hand motion accurately. We retrieved our data from movement sensors such as HTC vive and leap motion. Given a sequence position of palm, we represent our data as binary 2D images of gesture shape. Our dataset consisted of 14 kinds of hand gestures recommended by a physiotherapist. Given 33 3D points that were mapped into binary images as input, we trained our proposed density-based CNN. Our CNN model concerned with our input characteristics, having many 'blank block pixels', 'single-pixel thickness' shape and generated as a binary image. Pyramid kernel size applied on the feature extraction part and classification layer using softmax as loss function, have given 97.7% accuracy.