“…Finally, our approach relates to the other works that propose ways of incorporating structural information into Transformer-based models. This includes the use of dependency or tree structure for constraining self-attention patterns (Strubell et al, 2018;Wang et al, 2019;, guiding cross-attention (Chen et al, 2018;Astudillo et al, 2020), modelling syntactic distance (Du et al, 2020), using syntactic information to guide the computation flow in the model (Shen et al, 2021), or through knowledge distillation (Kuncoro et al, 2020). Our structured masking in parsing as language modeling approach is close in spirit to the methods that modify attention mechanism according to syntactic connections (Astudillo et al, 2020); This work, however, primarily aims to study the impact of structural guidance on syntactic generalization.…”