Properly wearing a face mask has become an effective way to limit the COVID-19 transmission. In this work, we target at detecting the fine-grained wearing state of face mask: face without mask, face with wrong mask, face with correct mask. This task has two main challenging points: 1) absence of practical datasets, and 2) small intra-class distance and large inter-class distance. For the first challenging point, we introduce a new practical dataset covering various conditions, which contains 8635 faces with different wearing status. For the second challenging point, we propose a novel detection framework about conditions of wearing face mask, named Context-Attention R-CNN, which enlarge the intra-class distance and shorten inter-class distance by extracting distinguishing features. Specifically, we first extract the multiple context feature for region proposals, and use attention module to weight these context feature from channel and spatial levels. And then, we decoupling the classification and localization branches to extract more appropriate feature for these two tasks respectively. Experiments show that the Context-Attention R-CNN achieves 84.1% mAP on our proposed dataset, outperforming Faster R-CNN by 6.8 points. Moreover, Context-Attention R-CNN still exceed some state-of-the-art single-stage detectors.