Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
This thesis verses about the research conducted in the topic of speaker recognition in real conditions like as meeting rooms, telephone quality speech and radio and TV broadcast news. The main objective is concerned to the automatic detection and the classification of speakers into a smart-room scenario. Acoustic speaker recognition is the application of a machine to identify an individual from a spoken sentence. It aims at processing the acoustic signals to convert them in symbolic descriptions corresponding to the identity of the speakers. For the last several years, speaker recognition in real situation has been attracting a substantial research attention becoming one of the spoken language technologies adding quality improvement, or enrichment, of recording transcriptions. In real conditions and particularly, the human activity that takes place in meeting-rooms or class-rooms, compared to other domains exhibits an increased complexity and a challenging problem due to the spontaneity of speech, reverberation effects, the presence of overlapped speech, room setup and channel variability or a rich assortment of acoustic events, either produced by the humans or by objects handled by them. Therefore, the determination of both the identity of speakers and their position in time may help to detect and describe that human activity and to provide machine context awareness. We first seek to improve traditional modeling approaches for speaker identification and verification, which are based on Gaussian Mixture Models, through multi-decision and multi-channel processing strategies, in smart-room scenario. We put emphasis in studying speaker and channel variability techniques such as Maximum a Posteriori Adaptation, Nuisance Attribute Projection, Joint Factor Analysis, or score normalization; aiming to find out strategies and techniques to deal with such drawback. Moreover, we describe a novel speaker verification algorithm that makes use of adapted features from automatic speech recognition. In a second line of research, related to speaker detection in continuous audio stream, where the optimum number of speakers of their identities are unknown a priory. We developed and adapted some of the previous speaker recognition techniques to a baseline speaker diarization system based upon Hidden Markov Models and Agglomerative Hierarchical Clustering. We evaluate the application of TDOA feature dynamics and other features in order to improve clustering initialization in the AHC or the detection and handling of speaker overlaps; we assess the impact and synergies with technologies like as Speech Activity Detection and Acoustic Event Detection integrated with the diarization system; and we propose and compare new methods as spectral clustering. Moreover, the adaptation of the diarization system to broadcast news domain and to the speaker tracking task is also addressed. Finally, the fusion and combination with video and image modalities is also highlighted across this thesis work, in both speaker identification and tracking approaches. Techniques such as Matching Weighting or Particle Filter are proposed in order to combine scores and likelihoods from different modalities. Results provided demonstrate that these information sources can play also an important role in the automatic person recognition task, adding complementary knowledge to the traditional acoustic spectrum-based recognition systems and thus improving their accuracy. This thesis work was performed in the framework of several international and national projects, among them the CHIL EU project and the Catalan founded project Tecnoparla; and in the participation in technology evaluations such as CLEAR, NIST Rich Transcription (RT), NIST Speaker Recognition Evaluation (SRE) and the Spanish tracking evaluation Albayzin. Esta tesis resume el trabajo realizado en el área de reconocimiento de hablantes en condiciones reales tales como reuniones en salas, en conversaciones de calidad telefónica y en el dominio de programas de tv y radio. El principal objetivo se centra en la detección automática y clasificación de hablantes en una sala inteligente. El reconocimiento automático del hablante se define como el uso de una maquina para identificar a un individuo a través de su voz. El objetivo es el procesamiento de la señal acústica para convertirla en descripciones simbólicas que se correspondan con las identidades de los hablantes. Durante los últimos años, el reconocimiento del hablante en situaciones reales ha atraído una sustancial atención de los investigadores convirtiéndose en una de las tecnologías del habla capaz de aportar calidad, o enriquecer, las transcripciones de grabaciones de audio. En condiciones reales y en concreto, la actividad humana que tiene lugar en salas de reuniones o clases docentes, comparada con la de otros dominios exhibe una mayor complejidad y es un problema arduo debido a la espontaneidad del habla, los efectos reverberantes, la presencia de solapamientos entre locutores, la configuración de la sala y la variabilidad de canal o la gran cantidad de eventos acústicos, tanto producidos por las personas como por objetos. Es evidente que discernir tanto la identidad del hablante como su posición en tiempo puede ayudar a describir la actividad y proporcionar el conocimiento y percepción de la situación por parte de la máquina. En el inicio se busca la mejora de los sistemas tradicionales de modelado para las tareas de identificación y verificación, basados en modelos de mezcla de Gaussianas, a través de estrategias de decisión múltiple y procesamiento multi-canal en salas inteligentes. El estudio se centra en técnicas de variabilidad del hablante y de canal tales como adaptación Maximum a Posteriori, proyecciones Nuisance Attribute, análisis factorial, o normalización de puntuaciones; intentando encontrar estrategias para atacar dicha problemática. Además, se describe un original método para la tarea de verificación del hablante que utiliza características adaptadas a través de un reconocedor automático del habla. Una segunda línea de investigación se relaciona con la detección automática en audio de múltiples hablantes, dónde tanto su número y sus identidades son desconocidas de antemano. En ella se desarrollan y adaptan algunas de las anteriores técnicas a un sistema estándard de diarización basado en modelos ocultos de Markov y clustering jerárquico aglomerado de los datos. Evaluamos la aplicación de la dinámica dada por características basadas en retardos entre sensores (TDOA) con intención de mejorar el clustering o la detección y tratamiento de los solapamientos entre hablantes; evaluamos el impacto y las sinergias creadas con tecnologías como la detección del habla y la detección de eventos acústicos, integrándolas con el diarizador y se propone un nuevo método basado en clustering espectral. Además se adapta el sistema de diarización tanto para el procesamiento de programas de radio y televisión como para el seguimiento de locutores específicos. A lo largo del trabajo se resalta la fusión y combinación con las modalidades de vídeo e imagen, tanto en diarización como en seguimiento de hablantes. Técnicas basadas en ponderación según acierto o en filtros de partículas se proponen para combinar puntuaciones y probabilidades generadas por cada modalidad. Esta tesis se realizó en el contexto de varios proyectos internacionales y nacionales, entre los que se encuentra el proyecto europeo CHIL y el proyecto Catalán Tecnoparla; y en la participación en evaluaciones de tecnología como CLEAR, NIST Rich Transcription (RT), NIST Speaker Recognition Evaluation (SRE) y la evaluación española Albayzin en seguimiento.
This thesis verses about the research conducted in the topic of speaker recognition in real conditions like as meeting rooms, telephone quality speech and radio and TV broadcast news. The main objective is concerned to the automatic detection and the classification of speakers into a smart-room scenario. Acoustic speaker recognition is the application of a machine to identify an individual from a spoken sentence. It aims at processing the acoustic signals to convert them in symbolic descriptions corresponding to the identity of the speakers. For the last several years, speaker recognition in real situation has been attracting a substantial research attention becoming one of the spoken language technologies adding quality improvement, or enrichment, of recording transcriptions. In real conditions and particularly, the human activity that takes place in meeting-rooms or class-rooms, compared to other domains exhibits an increased complexity and a challenging problem due to the spontaneity of speech, reverberation effects, the presence of overlapped speech, room setup and channel variability or a rich assortment of acoustic events, either produced by the humans or by objects handled by them. Therefore, the determination of both the identity of speakers and their position in time may help to detect and describe that human activity and to provide machine context awareness. We first seek to improve traditional modeling approaches for speaker identification and verification, which are based on Gaussian Mixture Models, through multi-decision and multi-channel processing strategies, in smart-room scenario. We put emphasis in studying speaker and channel variability techniques such as Maximum a Posteriori Adaptation, Nuisance Attribute Projection, Joint Factor Analysis, or score normalization; aiming to find out strategies and techniques to deal with such drawback. Moreover, we describe a novel speaker verification algorithm that makes use of adapted features from automatic speech recognition. In a second line of research, related to speaker detection in continuous audio stream, where the optimum number of speakers of their identities are unknown a priory. We developed and adapted some of the previous speaker recognition techniques to a baseline speaker diarization system based upon Hidden Markov Models and Agglomerative Hierarchical Clustering. We evaluate the application of TDOA feature dynamics and other features in order to improve clustering initialization in the AHC or the detection and handling of speaker overlaps; we assess the impact and synergies with technologies like as Speech Activity Detection and Acoustic Event Detection integrated with the diarization system; and we propose and compare new methods as spectral clustering. Moreover, the adaptation of the diarization system to broadcast news domain and to the speaker tracking task is also addressed. Finally, the fusion and combination with video and image modalities is also highlighted across this thesis work, in both speaker identification and tracking approaches. Techniques such as Matching Weighting or Particle Filter are proposed in order to combine scores and likelihoods from different modalities. Results provided demonstrate that these information sources can play also an important role in the automatic person recognition task, adding complementary knowledge to the traditional acoustic spectrum-based recognition systems and thus improving their accuracy. This thesis work was performed in the framework of several international and national projects, among them the CHIL EU project and the Catalan founded project Tecnoparla; and in the participation in technology evaluations such as CLEAR, NIST Rich Transcription (RT), NIST Speaker Recognition Evaluation (SRE) and the Spanish tracking evaluation Albayzin. Esta tesis resume el trabajo realizado en el área de reconocimiento de hablantes en condiciones reales tales como reuniones en salas, en conversaciones de calidad telefónica y en el dominio de programas de tv y radio. El principal objetivo se centra en la detección automática y clasificación de hablantes en una sala inteligente. El reconocimiento automático del hablante se define como el uso de una maquina para identificar a un individuo a través de su voz. El objetivo es el procesamiento de la señal acústica para convertirla en descripciones simbólicas que se correspondan con las identidades de los hablantes. Durante los últimos años, el reconocimiento del hablante en situaciones reales ha atraído una sustancial atención de los investigadores convirtiéndose en una de las tecnologías del habla capaz de aportar calidad, o enriquecer, las transcripciones de grabaciones de audio. En condiciones reales y en concreto, la actividad humana que tiene lugar en salas de reuniones o clases docentes, comparada con la de otros dominios exhibe una mayor complejidad y es un problema arduo debido a la espontaneidad del habla, los efectos reverberantes, la presencia de solapamientos entre locutores, la configuración de la sala y la variabilidad de canal o la gran cantidad de eventos acústicos, tanto producidos por las personas como por objetos. Es evidente que discernir tanto la identidad del hablante como su posición en tiempo puede ayudar a describir la actividad y proporcionar el conocimiento y percepción de la situación por parte de la máquina. En el inicio se busca la mejora de los sistemas tradicionales de modelado para las tareas de identificación y verificación, basados en modelos de mezcla de Gaussianas, a través de estrategias de decisión múltiple y procesamiento multi-canal en salas inteligentes. El estudio se centra en técnicas de variabilidad del hablante y de canal tales como adaptación Maximum a Posteriori, proyecciones Nuisance Attribute, análisis factorial, o normalización de puntuaciones; intentando encontrar estrategias para atacar dicha problemática. Además, se describe un original método para la tarea de verificación del hablante que utiliza características adaptadas a través de un reconocedor automático del habla. Una segunda línea de investigación se relaciona con la detección automática en audio de múltiples hablantes, dónde tanto su número y sus identidades son desconocidas de antemano. En ella se desarrollan y adaptan algunas de las anteriores técnicas a un sistema estándard de diarización basado en modelos ocultos de Markov y clustering jerárquico aglomerado de los datos. Evaluamos la aplicación de la dinámica dada por características basadas en retardos entre sensores (TDOA) con intención de mejorar el clustering o la detección y tratamiento de los solapamientos entre hablantes; evaluamos el impacto y las sinergias creadas con tecnologías como la detección del habla y la detección de eventos acústicos, integrándolas con el diarizador y se propone un nuevo método basado en clustering espectral. Además se adapta el sistema de diarización tanto para el procesamiento de programas de radio y televisión como para el seguimiento de locutores específicos. A lo largo del trabajo se resalta la fusión y combinación con las modalidades de vídeo e imagen, tanto en diarización como en seguimiento de hablantes. Técnicas basadas en ponderación según acierto o en filtros de partículas se proponen para combinar puntuaciones y probabilidades generadas por cada modalidad. Esta tesis se realizó en el contexto de varios proyectos internacionales y nacionales, entre los que se encuentra el proyecto europeo CHIL y el proyecto Catalán Tecnoparla; y en la participación en evaluaciones de tecnología como CLEAR, NIST Rich Transcription (RT), NIST Speaker Recognition Evaluation (SRE) y la evaluación española Albayzin en seguimiento.
Aquesta tesi doctoral mostra la recerca feta en l'àrea de la diarització de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementació d'un sistema en diferit de segmentació i aglomerat de locutor per a grabacions de reunions a on normalment es té accés a més d'un micròfon per al processat. El bloc més important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un període de dos anys.<br/><br/>La diarització de locutor s'ha estudiat força per al domini de grabacions de ràdio i televisió. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jeràrquic de les dades en grups acústics a on de bon principi no se sap el número de locutors òptim ni tampoc la seva identitat. Un mètode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups acústics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre òptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'anàlisi d'un canal d'entrada individual, el qual no permet la seva aplicació directa per a reunions. A més a més, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptació.<br/><br/>La implementació proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diarització de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabació disponibles per a obtindre un sol canal d'audio de qualitat major, a més dínformació sobre la posició dels locutors existents. Aleshores s'implementa un sistema de detecció de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versió millorada del sistema mono-canal de diarització de locutor. Aquest sistema ha estat modificat per a l'ús de l'informació de posició dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informació com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. <br/><br/>El sistema resultant és flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micròfons o la seva posició. El sistema, a més, no requereix en absolute dades d´entrenament, sent més senzill adaptar-lo a diferents tipus de dades o dominis d'aplicació. Finalment, fa un pas endavant en l'ús de parametres que siguin mes robusts als canvis en les dades acústiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripció rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A més a més, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca. This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. <br/>Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.<br/>The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.<br/>The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.