In this paper, we present a method to detect the handobject interaction from egocentric perspective. In contrast to massive data driven discriminator based method like [24], we propose a novel workflow that utilise the cues of hand and object. Specifically, we train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status. We compare our method with the most recent work from Shan et al. [24] on selected images from EPIC-KITCHENS [4] dataset and achieve 89% accuracy on HOI (hand-object interaction) detection which is comparative to Shan's (92%). However, for real-time performance, with the same machine, our method can run over 30 FPS which is much efficient than Shan's (1 ∼ 2 FPS). Furthermore, with our approach, we are able to segment script-less activities from where we extract the frames with the HOI status detection. We achieve 68.2% and 82.8% F1 score on GTEA [7] and the UTGrasp [1] dataset respectively which are all comparative to the SOTA methods.