ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746806
|View full text |Cite
|
Sign up to set email alerts
|

TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 45 publications
(20 citation statements)
references
References 21 publications
0
20
0
Order By: Relevance
“…Subsequently, Whisper generates the transcription, while WhisperX [1] corrects and aligns the timestamps to mitigate diarization errors due to time shifts. Following this, MarbleNet [10] executes VAD and segmentation to filter out silences, and TitaNet [12] extracts speaker embeddings to identify speakers for each segment. The results are then associated with timestamps produced by WhisperX, attributing speakers to each word based on timestamps, which are further adjusted using punctuation models to compensate for minor time shifts.The below figure Figure 3 is a pictorial representation of audio extraction as explained in the section.…”
Section: Auditory Memorymentioning
confidence: 99%
“…Subsequently, Whisper generates the transcription, while WhisperX [1] corrects and aligns the timestamps to mitigate diarization errors due to time shifts. Following this, MarbleNet [10] executes VAD and segmentation to filter out silences, and TitaNet [12] extracts speaker embeddings to identify speakers for each segment. The results are then associated with timestamps produced by WhisperX, attributing speakers to each word based on timestamps, which are further adjusted using punctuation models to compensate for minor time shifts.The below figure Figure 3 is a pictorial representation of audio extraction as explained in the section.…”
Section: Auditory Memorymentioning
confidence: 99%
“…For each speaker, four samples were considered as a reference, and only one of the samples was used testing. Then, using the TitaNet [45] pre-trained model, which was trained on 5 datasets (Voxceleb [24] and Voxceleb2 [25], NIST SRE portion of datasets from 2004-2008 (LDC2009E100), Switchboard-Cellular1 and Switchboard-Cellular2 [46], Fisher [47] and Librispeech [48]). We fed x at 1:T as a test and x a0 1:T , ..., x a3 1:T as references to the TitaNet [45] model.…”
Section: Speaker Recognitionmentioning
confidence: 99%
“…Then, using the TitaNet [45] pre-trained model, which was trained on 5 datasets (Voxceleb [24] and Voxceleb2 [25], NIST SRE portion of datasets from 2004-2008 (LDC2009E100), Switchboard-Cellular1 and Switchboard-Cellular2 [46], Fisher [47] and Librispeech [48]). We fed x at 1:T as a test and x a0 1:T , ..., x a3 1:T as references to the TitaNet [45] model. The model produces the 192-dimensional embeddings of each audio sample, for example, v at 1:T .…”
Section: Speaker Recognitionmentioning
confidence: 99%
“…Next, the transcription was diarised using available pretrained models [41,42] with lowest diarisation error rate on our dataset (approx. %1.7) to be able to recognise participants in the session.…”
Section: Transcription and Diarisationmentioning
confidence: 99%