The reconstruction of 3D face shapes and expressions from single 2D images remains unconquered due to the lack of detailed modeling of human facial movements such as the correlation between the different parts of faces. Facial action units (AUs), which represent detailed taxonomy of the human facial movements based on observation of activation of muscles or muscle groups, can be used to model various facial expression types. We present a novel 3D face reconstruction framework called AU feature-based 3D FAce Reconstruction using Transformer (AUFART) that can generate a 3D face model that is responsive to AU activation given a single monocular 2D image to capture expressions. AUFART leverages AU-specific features as well as facial global features to achieve accurate 3D reconstruction of facial expressions using transformers. We also introduce a loss function which is to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. The proposed framework achieves an average F1 score of 0.39, outperforming state-of-the-art methods.