“…In addition to feed-forward deep neural networks (DNNs) (Hinton et al, 2012), recurrent and convolutional models have also been applied to extract d-vectors at frame-level (Variani et al, 2014;Yella and Stolcke, 2015;Heigold et al, 2016;Cyrta et al, 2017;Wang et al, 2018b). To convert a variable length segment into a fixed-length vector using frame-level d-vectors, a temporal pooling function, such as the mean and standard deviation (Garcia-Romero et al, 2017;Diez et al, 2019;Wang et al, 2018c), attention mechanisms (Chowdhury et al, 2018;Zhu et al, 2018;Sun et al, 2019;Shi et al, 2020), and their combination (Okabe et al, 2018), have been used, which also enables joint training over entire segments.…”