Maintaining adequate hydration is important for health. Inadequate liquid intake can cause dehydration problems. Despite the increasing development of liquid intake monitoring, there are still open challenges in drinking detection under free-living conditions. This paper proposes an automatic liquid intake monitoring system comprised of wrist-worn Inertial Measurement Units (IMUs) to recognize drinking gesture in free-living environments. We build an end-to-end approach for drinking gesture detection by employing a novel multi-stage temporal convolutional network (MS-TCN). Two datasets are collected in this research, one contains 8.9 hours data from 13 participants in semi-controlled environments, the other one contains 45.2 hours data from 7 participants in free-living environments. The Leave-One-Subject-Out (LOSO) evaluation shows that this method achieves a segmental F1-score of 0.943 and 0.900 in the semi-controlled and free-living datasets, respectively. The results also indicate that our approach outperforms the convolutional neural network and long-short-term-memory network combined model (CNN-LSTM) on our datasets. The dataset used in this paper is available at https://github.com/Pituohai/drinkinggesture-dataset/.Clinical relevance-This automatic liquid intake monitoring system can detect drinking gesture in daily life. It has the potential to be used to record the frequency of drinking water for at-risk elderly or patients in the hospital.