“…The grid search covers each of the combinations of the following hyperparameter values: the number of CNN feature maps/RNN hidden units (the same amount for both) {96, 256}; the number of recurrent layers {1, 2, 3}; and the number of convolutional layers {1, 2, 3 ,4} with the following frequency max pooling arrangements after each convolutional layer {(4), (2, 2), (4, 2), (8,5), (2, 2, 2), (5, 4, 2), (2, 2, 2, 1), (5, 2, 2, 2)}. Here, the numbers denote the number of frequency bands at each max pooling step; e.g., the configuration (5, 4, 2) pools the original 40 bands to one band in three stages: 40 bands → 8 bands → 2 bands → 1 band.…”