“…for downstream tasks. In speech representation learning (Latif et al, 2020), unsupervised techniques such as autoregressive modeling (Chung, Hsu, Tang and Glass, 2019;Chung and Glass, 2020a,b) and self-supervised modeling (Milde and Biemann, 2018;Tagliasacchi, Gfeller, Quitry and Roblek, 2019;Pascual, Ravanelli, Serrà, Bonafonte and Bengio, 2019) employ temporal context information for extracting speech representation. In our prior behavior modeling work, an unsupervised representative learning framework was proposed (Li, Baucom and Georgiou, 2017), which showed the promise of learning behavior representations based on the behavior stationarity hypothesis that nearby segments of speech share the same behavioral context.…”