Acoustic-based hand tracking technology leads to the next generation of the Human-Computer Interaction (HCI) mechanism. This approach uses embedded speakers and microphones on commercial devices to send and receive acoustic signals simultaneously, then the echo can be processed to obtain the hand's position. However, existing tracking approaches do not support multi-stroke input, as a result, the trajectory is incapable of character recognition with models trained by simple character images from databases such as MNIST and EMNIST. In this paper, V-Pen is proposed to estimate the status of the hand with the energy information acquired from the echo. Subsequently, Zadoff-Chu (ZC) sequences are used to obtain the initial position of the hand, and track the hand continuously with the change of phase for a smooth trajectory. While inputting the characters, V-Pen allows the user to input multiple strokes to get rid of redundant trajectory which affects the recognition. The results show that V-Pen achieves an average error of 4.3 mm for tracking and 94.8% recognition accuracy for 52 English letters, 10 numbers, and 20 Chinese characters.