“…Hierarchical structure is widely thought to be essential to modeling natural language, in particular its syntax (Everaert et al, 2015). Consequently, many researchers have studied the capability of recurrent neural network models to capture context-free languages (e.g., Kalinke and Lehmann, 1998;Gers and Schmidhuber, 2001;Grüning, 2006;Weiss et al, 2018;Sennhauser and Berwick, 2018;Korsky and Berwick, 2019) and linguistic phenomena involving hierarchical structure (e.g., Linzen et al, 2016;Gulordava et al, 2018). Some experimental evidence suggests that transformers might not be as strong as LSTMs at modeling hierarchical structure (Tran et al, 2018), though analysis studies have shown that transformer-based models encode a good amount of syntactic knowledge (e.g., Clark et al, 2019;Lin et al, 2019;Tenney et al, 2019).…”