The recently proposed End-to-End Neural speaker Diarization framework (EEND) handles speech overlap and speech activity detection natively.While extensions of this work have reported remarkable results in both two-speaker and multispeaker diarization scenarios, these come at the cost of a long training process that requires considerable memory and computational power. In this work, we explore the integration of efficient transformer variants into the Self-Attentive EEND with Encoder-Decoder based Attractors (SA-EEND EDA) architecture. Since it is based on Transformers, the cost of training SA-EEND EDA is driven by the quadratic time and memory complexity of their self-attention mechanism. We verify that the use of a linear attention mechanism in SA-EEND EDA decreases GPU memory usage by 22%. We conduct experiments to measure how the increased efficiency of the training process translates into the two-speaker diarization error rate on CALLHOME, quantifying the impact of increasing the size of the batch, the model or the sequence length on training time and diarization performance. In addition, we propose an architecture combining linear and softmax attention that achieves an acceleration of 12% with a small relative DER degradation of 2%, while using the same GPU memory as the softmax attention baseline.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.