“…However, while a number of deep learning algorithms have been successfully shown to solve numerous end-stage problems like prediction and classification (Glorot, Bordes & Bengio, 2011;LeCun, Bengio & Hinton, 2015;Abadi et al, 2016), very few attempts have been made to use them for solving the intermediate problems of data pre-processing (Kotsiantis, Kanellopoulos & Pintelas, 2006;García, Luengo & Herrera, 2015), cleaning (Kotsiantis, Kanellopoulos & Pintelas, 2006;García, Luengo & Herrera, 2015), and restoration (Efron, 1994;Lakshminarayan et al, 1996), even though from a machine learning perspective these end-stage and intermediate problems can be very similar. Long Short-Term Memory (LSTM) networks have previously been proposed as a solution to these intermediate problems (Zhou & Huang, 2017;Sucholutsky et al, 2019), but they suffer from major bottlenecks like requiring large numbers of sequential operations that cannot be parallelized. Recently, Transformer (Vaswani et al, 2017), a novel encoder-decoder model that heavily uses attention mechanisms (Luong, Pham & Manning, 2015), was proposed as a replacement for encoder-decoder models that use LSTM or convolutional layers, and was shown to achieve state-of-the-art translation results with orders of magnitude fewer parameters than existing models.…”