Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. The two systems are trained independently with different objective functions. Often the SD systems operate directly on the acoustics and are not constrained to respect word boundaries and this deficiency is overcome in an ad hoc manner. Motivated by recent advances in sequence to sequence learning, we propose a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer. Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. We evaluated the performance of our approach on a large corpus of medical conversations between physicians and patients. Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%.
Abstract-In this paper, we present a scalable and exact solution for probabilistic linear discriminant analysis (PLDA). PLDA is a probabilistic model that has been shown to provide stateof-the-art performance for both face and speaker recognition. However, it has one major drawback: At training time estimating the latent variables requires the inversion and storage of a matrix whose size grows quadratically with the number of samples for the identity (class). To date, two approaches have been taken to deal with this problem, to 1) use an exact solution that calculates this large matrix and is obviously not scalable with the number of samples or 2) derive a variational approximation to the problem.We present a scalable derivation which is theoretically equivalent to the previous nonscalable solution and thus obviates the need for a variational approximation. Experimentally, we demonstrate the efficacy of our approach in two ways. First, on Labeled Faces in the Wild (LFW), we illustrate the equivalence of our scalable implementation with previously published work. Second, on the large Multi-PIE database, we illustrate the gain in performance when using more training samples per identity (class), which is made possible by the proposed scalable formulation of PLDA.
This paper examines session variability modelling for face authentication using Gaussian mixture models. Session variability modelling aims to explicitly model and suppress detrimental within-class (inter-session) variation. We examine two techniques to do this, inter-session variability modelling (ISV) and joint factor analysis (JFA), which were initially developed for speaker authentication. We present a self-contained description of these two techniques and demonstrate that they can be successfully applied to face authentication. In particular, we show that using ISV leads to significant error rate reductions of, on average, 26% on the challenging and publicly-available databases SCface, BANCA, MOBIO, and Multi-PIE. Finally, we show that a limitation of both ISV and JFA for face authentication is that the session variability model captures and suppresses a significant portion of between-class variation.
One important type of biometric authentication is face recognition, a research area of high popularity with a wide spectrum of approaches that have been proposed in the last few decades. The majority of existing approaches are conceived for or evaluated on constrained still images. However, more recently research interests have shifted towards unconstrained "in-the-wild" still images and videos. To some extent, current state-of-the-art systems are able to cope with variability due to pose, illumination, expression, and size, which represent the challenges in unconstrained face recognition. To date, only few attempts have addressed the problem of face recognition in mobile environment, where high degradation is present during both data acquisition and transmission. This book chapter deals with face recognition in mobile and other challenging environments, where both still images and video sequences are examined. We provide an experimental study of one commercial of-the-shelf and four recent open-source face recognition algorithms, including color-based linear discriminant analysis, local Gabor binary pattern histogram sequences, Gabor grid graphs and inter-session variability modeling. Experiments are performed on several freely available challenging still image and video face databases, including one mobile database, always following the evaluation protocols that are attached to the databases. Finally, we supply an easily extensible opensource toolbox to re-run all the experiments, which includes the modeling techniques, the evaluation protocols and metrics used in the experiments, and provides a detailed description on how to re-generate the results.
This paper examines the issue of face, speaker and bi-modal authentication in mobile environments when there is significant condition mismatch. We introduce this mismatch by enrolling client models on high quality biometric samples obtained on a laptop computer and authenticating them on lower quality biometric samples acquired with a mobile phone. To perform these experiments we develop three novel authentication protocols for the large publicly available MOBIO database. We evaluate state-of-the-art face, speaker and bi-modal authentication techniques and show that inter-session variability modelling using Gaussian mixture models provides a consistently robust system for face, speaker and bi-modal authentication. It is also shown that multi-algorithm fusion provides a consistent performance improvement for face, speaker and bi-modal authentication. Using this bi-modal multi-algorithm system we derive a state-of-the-art authentication system that obtains a half total error rate of 6.3% and 1.9% for Female and Male trials, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.