“…Different types of acoustic features have been used for this task: spectral features [1,2], posterior features (posterior probability vector for phone or phone-like units) [3,4] as well as bottleneck features [5,6]. The matching likelihood is generally obtained by computing a frame-level similarity matrix between the query and each audio document using the corresponding feature The research is funded by the Swiss NSF project 'PHASER-QUAD', grant agreement number 200020-169398. vectors and employing a dynamic time warping (DTW) [4,5] or convolutional neural network (CNN) based matching technique [7]. Several variants of DTW have been used: Segmental DTW [1,3], Slope-constrained DTW [8], Sub-sequence DTW [9], Subspace-regularized DTW [10,11] etc.…”