Abstract-Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification.
Abstract-Broadcast News (BN) transcription has been a challenging research area for many years. In the last couple of years the availability of large amounts of roughly transcribed acoustic training data and advanced model training techniques has offered the opportunity to greatly reduce the error rate on this task. This paper describes the design and performance of BN transcription systems which make use of these developments. First the effects of using lightly-supervised training data and advanced acoustic modelling techniques are discussed. The design of a real-time broadcast news recognition system is then detailed using these new models. As system combination has been found to yield large gains in performance, a range of frameworks that allow multiple recognition outputs to be combined are next described. These include the use of multiple types of acoustic models and multiple segmentations. As a contrast a system developed by multiple sites allowing cross-site combination, the "SuperEARS" system, is also described. The various models and recognition configurations are evaluated using several recent BN development and evaluation test sets. These new BN transcription systems can give gains of over 25% relative to the CU-HTK 2003 BN system.
This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT-03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previously developed in the context of the recognition of conversational telephone speech, have been successfully applied to the BN-E task for the first time. A number of new features have also been added. These include gender-dependent (GD) discriminative training; and modified discriminative training using lattice re-generation and combination. On the 2003 evaluation set the system gave an overall word error rate of 10.7% in less than 10 times real time (10×RT).
Automatic speaker segmentation and clustering methods have improved considerably over the last few years in the Broadcast News domain. However, these generally still produce locally consistent relative labels (such as spkr1, spkr2) rather than true speaker identities (such as Bill Clinton, Ted Koppel). This paper presents a system which attempts to find these true identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.
A cluster-voting scheme is described which takes the output from two speaker diarisation systems and produces a new output which aims to have a lower speaker diarisation error rate (DER) than either input. The scheme works in two stages, firstly producing a set of possible outputs which minimise a distance metric based on the DER and secondly voting between these alternatives to give the final output. Decisions where the inputs agree are always passed to the output and those where the inputs differ are re-evaluated in the final voting stage. Results are presented on the 6-show RT-03 Broadcast News evaluation data, showing the DER can be reduced by 1.64% and 2.56% absolute using this method when combining the best two Cambridge University and the best two MIT Lincoln Laboratory diarisation systems respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.