Hand gestures are useful tools for many applications in the human-computer interaction community. Here, the objective is to track the movement of the hand irrespective of the shape, size and color of the hand. And, for this, a motion template guided by optical flow (OFMT) is proposed. OFMT is a compact representation of the motion information of a gesture encoded into a single image. Recently, deep networks have shown impressive improvements as compared to conventional hand-crafted feature-based techniques. Moreover, it is seen that the use of different streams with informative input data helps to increase the recognition performance. This work basically proposes a two-stream fusion model for hand gesture recognition. The two-stream network consists of two layers—a 3D convolutional neural network (C3D) that takes gesture videos as input and a 2D-CNN that takes OFMT images as input. C3D has shown its efficiency in capturing spatiotemporal information of a video, whereas OFMT helps to eliminate irrelevant gestures providing additional motion information. Though each stream can work independently, they are combined with a fusion scheme to boost the recognition results. We have shown the efficiency of the proposed two-stream network on two databases.