This paper focuses on the achievement of effective human-computer interaction using only webcam by continuous locating or tracking and recognizing the hand region. We detected the region of interest (ROI) in the captured image range and classify hand gestures for specific tasks. Firstly, background subtraction is used based on the main frame captured by webcam, and some preprocessing are done, and then YCrCb skin segmentation is used on RGB subtracted image. The ROI is detected using Haar cascade classifier for hand palm detection. Next, kernelized correlation filters tracking algorithm is used to avoid noise or background influences for tracking the ROI, and the median-flow tracking algorithm is used for depth tracking. The ROI is converted to a binary channel (black and white), resized to 54 × 54. Then gesture recognition is done using a 2D convolutional neural network (CNN) by entering the preprocessed ROI on the architecture. Two predictions are made based on skin segmented frame and image dilated frame, and gesture is recognized from the maximum value of those two predictions. The tracking and recognition process is continued until the ROI is presented on the frames. Finally, after validation, the proposed system has successfully obtained a recognition rate of 98.44%, which is usable for the practical and real-time application. Keywords Human-computer interaction • Hand gesture detection • Hand tracking • Convolutional neural network This article is part of the topical collection "Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications" guest edited by Bhanu Prakash K N and M. Shivakumar.