Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.312
|View full text |Cite
|
Sign up to set email alerts
|

A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI

Abstract: Automatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e.g. Google, IBM, and Microsoft). Currently the most stringent standards for such systems are set within the context of their use in, and for, Conversational AI technology. These systems are expected to operate incrementally in real-time, be responsive, stable, and robust to the pervasive yet peculiar characteristics of conversational speech such a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(14 citation statements)
references
References 25 publications
0
14
0
Order By: Relevance
“…However, reducing the overlap ratio by increasing the value of β also makes the duration of silence longer; thus, it is difficult to make the simulated mixture closer to a natural conversation by simply adjusting the value of β. Moreover, concatenating a zero vector in (6) to align the lengths of the long recordings causes unnaturalness of the conversation; various speakers speak at the beginning of the mixtures while only some of them speak at the end. In contrast, the proposed simulation protocol can generate more natural conversations by ordering each speaker's utterances to follow the statistics calculated from real conversational data.…”
Section: Conventional Simulation Methodmentioning
confidence: 99%
See 1 more Smart Citation
“…However, reducing the overlap ratio by increasing the value of β also makes the duration of silence longer; thus, it is difficult to make the simulated mixture closer to a natural conversation by simply adjusting the value of β. Moreover, concatenating a zero vector in (6) to align the lengths of the long recordings causes unnaturalness of the conversation; various speakers speak at the beginning of the mixtures while only some of them speak at the end. In contrast, the proposed simulation protocol can generate more natural conversations by ordering each speaker's utterances to follow the statistics calculated from real conversational data.…”
Section: Conventional Simulation Methodmentioning
confidence: 99%
“…Speaker diarization is the task of identifying speech segments and their speakers from audio or video recordings; in other words, a task to identify "who spoke when" [1]. It is widely utilized in a variety of applications such as meeting transcription [2,3], conversational interaction analysis [4], content-based audio indexing [5], and conversational AI [6]. It also helps improve the accuracy of automatic speech recognition (ASR) in multi-speaker conversations [7].…”
Section: Introductionmentioning
confidence: 99%
“…Related to this, spontaneous human language production is often disfluent -it contains many fillers (um, er, uh etc), pauses, re-starts, and repairs. We therefore work on incremental language processing in understanding and generation, and the handling of disfluencies [57], often using the framework of Dynamic Syntax [2,19].…”
Section: Incremental Processingmentioning
confidence: 99%
“…This paper will largely focus on the first two types of system and the research problems and directions that they generate, as they are most closely related to research in Multi-Agent Systems. 2 In general we model each human conversation partner as an agent which has goals, plans, and preferences, and which can send signals (usually Natural Language speech or text 3 ) to other agents to convey its goals and request information, coordinate actions and plans etc.…”
Section: Introductionmentioning
confidence: 99%
“…Since response time and online processing are the crucial factors in real-life settings, the demand for endto-end speaker diarization system integrated into ASR pipeline is growing. The performance of incremental (online) ASR and speaker diarization of the commercial ASR services are evaluated and compared in [228]. It is expected that the real-time and low latency aspect of speaker diarization will be more emphasized in the speaker diarization systems in the future since the performance of online diarization and online ASR still have much room for improvement.…”
Section: Conversational Aimentioning
confidence: 99%