“…To remove speaker information from the SSL model outputs, some techniques utilize an information bottleneck approach such as quantization (Polyak et al, 2021;Lakhotia et al, 2021;Gu et al, 2021). Alternatively, several researchers have proposed training strategies that employ an information perturbation technique to eliminate speaker information without quantization (Qian et al, 2022;Choi et al, 2021;2023;Hussain et al, 2023). Notably, for training synthesizers, NANSY (Choi et al, 2021) and NANSY++ (Choi et al, 2023) propose to heuristically perturb the voice of a given utterance with hand-engineered data augmentations, before obtaining the output from the SSL model.…”