Hand hygiene is critical for declining the spread of viruses and diseases. Over recent years, it has been globally known as one of the most effective ways against COVID-19 outbreak. The World Health Organization (WHO) has suggested a 12-step guideline for hand rubbing. Due to the importance of this guideline, several studies have been conducted to measure compliance with it using Computer Vision. However, almost all of them are based on processing single images as input, referred to as baseline models in this paper. This study proposes a sequence model in order to process sequences of consecutive images as input. The model is a mixture of Inception-ResNet architecture for spatial feature extraction and LSTM for detecting time-series information. After training the model on a comprehensive dataset, an accuracy of 98.99% was achieved on the test set. Compared to the best baseline models, the proposed sequence model is correspondingly about 1% and 4% better in terms of accuracy and confidence, though 3 times slower in inference time. Furthermore, this study demonstrates that the accuracy metric is not necessarily adequate to compare different models and optimize their hyperparameters. Accordingly, the Feature-Based Confidence Metric was utilized in order to provide a more pleasing comparison to discriminate the proposed sequence model with the best baseline model and optimize its hyperparameters.