“…Such algorithms decompose a joint prediction task into a sequence of action predictions, such as predicting the label of the next word in sequence labeling or predicting a shift/reduce action in dependency parsing 3 ; these predictions are tied by features and/or internal state. Algorithms in this family have recently met with success in neural networks (Bengio et al, 2015;Wiseman and Rush, 2016), though date back to models typically based on linear policies (Collins and Roark, 2004;Daumé III and Marcu, 2005;Xu et al, 2007;Daumé III et al, 2009;Ross et al, 2010;Ross and Bagnell, 2014;Doppa et al, 2014;Chang et al, 2015).…”