“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”