Toward the ultimate synthesis/recognition system.

Furui, Sadaoki

doi:10.1073/pnas.92.22.10040

Cited by 5 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As computerized technology becomes an ever greater fixture at home and at work, our future interactions with it will need to become even more sophisticated ( Wendemuth and Biundo, 2011 ; Honold et al, 2014 ). Some time ago, it was recommended that artificial speech synthesis technology should not only have the ability to control prosody based on meaning, but also the capability to control individual speaking style (another form of prosody), choosing application-oriented speaking styles, and be able to add emotion ( Furui, 1995 ). Yet, as we have seen, there remains much work to be done ( Burkhardt and Stegmann, 2009 ).…”

Section: What About the Future?mentioning

confidence: 99%

“…Additional work on the social skills and responsivity with which HCI-AI are programmed will likely increase the empathy and acceptance level of interactions further ( Leite et al, 2013 ). From the human interface point of view, it has long been recognized that HCI-AI should be able to automatically acquire new knowledge about the thinking process of individual users, automatically correct user errors, and understand user intentions by accepting rough instructions and inferring details ( Furui, 1995 ). Ultimately, the hope for the future is that HCI-AI could extract the prosodic cues from a user’s speech, capitalize on the information to inform predictive models of likely emotions ( Litman and Forbes-Riley, 2006 ), and amend their own displays and actions accordingly.…”

Section: What About the Future?mentioning

confidence: 99%

See 1 more Smart Citation

What is the Value of Embedding Artificial Emotional Prosody in Human–Computer Interactions? Implications for Theory and Design in Psychological Science

Mitchell

2015

Front. Psychol.

View full text Add to dashboard Cite

In computerized technology, artificial speech is becoming increasingly important, and is already used in ATMs, online gaming and healthcare contexts. However, today’s artificial speech typically sounds monotonous, a main reason for this being the lack of meaningful prosody. One particularly important function of prosody is to convey different emotions. This is because successful encoding and decoding of emotions is vital for effective social cognition, which is increasingly recognized in human–computer interaction contexts. Current attempts to artificially synthesize emotional prosody are much improved relative to early attempts, but there remains much work to be done due to methodological problems, lack of agreed acoustic correlates, and lack of theoretical grounding. If the addition of synthetic emotional prosody is not of sufficient quality, it may risk alienating users instead of enhancing their experience. So the value of embedding emotion cues in artificial speech may ultimately depend on the quality of the synthetic emotional prosody. However, early evidence on reactions to synthesized non-verbal cues in the facial modality bodes well. Attempts to implement the recognition of emotional prosody into artificial applications and interfaces have perhaps been met with greater success, but the ultimate test of synthetic emotional prosody will be to critically compare how people react to synthetic emotional prosody vs. natural emotional prosody, at the behavioral, socio-cognitive and neural levels.

show abstract

Section: What About the Future?mentioning

confidence: 99%

Section: What About the Future?mentioning

confidence: 99%

What is the Value of Embedding Artificial Emotional Prosody in Human–Computer Interactions? Implications for Theory and Design in Psychological Science

Mitchell

2015

Front. Psychol.

View full text Add to dashboard Cite

show abstract

Speech technology in the year 2001.

Levinson¹,

Fallside²

1995

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

This paper introduces the session "Technology in the Year 2001" and is the first of four papers dealing with the future of human-machine communication by voice. In looking to the future it is important to recognize both the difficulties of technological forecasting and the frailties of the technology as it exists today-frailties that are manifestations of our limited scientific understanding of human cognition. The technology to realize truly advanced applications does not yet exist and cannot be supported by our presently incomplete science of speech. To achieve this long-term goal, the authors advocate a fundamental research program using a cybernetic approach substantially different from more conventional synthetic approaches. In a cybernetic approach, feedback control systems will allow a machine to adapt to a linguistically rich environment using reinforcement learning.The title of this session is "Technology in the Year 2001." This colloquium has discussed a number of the state-of-the-art issues: the scientific bases of human-machine communication by voice; the three technologies, recognition, synthesis, and natural language understanding; and, finally, the applications of this technology.When the blueprint for this session was fitted together this session was called "Future Technology." The organizers felt that we should think really about it in a very "blue sky" sort of way. I was alarmed by the project altogether at that stage, rushed back home, and started reading about Leonardo da Vinci, H. G. Wells, and dreamed up a few impossible applications for speech recognition. During these ruminations, I thought, there are many interesting things we could discoverhow to navigate the oceans of the world safely or, possibly, information about the location of treasure ships lost by the Spanish many years ago. I am sure that squids and other marine animals could tell us a great deal about that. There is also the question of HAL or Blade Runner, Ed Newbard, and old Napoleon Solo who used to ask for channel D. However, after some discussion with the speakers today, they indicated they did not want this sort of stuff at all.It was decided that we should talk about evolutionary technology-rather than revolutionary technology. So we are talking about what is likely to be possible in the year 2001. In passing, we might note that the ideas of some of our predictions are not all that far away. We have rough models of HAL right now; of Blade Runner, I'm less certain.However, we have put together a very interesting program for this last session. Certainly, the three speakers are eminently suited to this. They have all made significant contributions to the state of the art in several areas. One of the things we decided to do was to change the order slightly so that Sadaoki Furui will talk first about ultimate synthesis/recognition systems to give us a flavor of his view of the systems that are likely to be available (1). And then our two other experts will discuss research directions-B. Atal, in the area of speech (2...

show abstract

Speech recognition technology: a critique.

Levinson

1995

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

This paper introduces the session on advanced speech recognition technology. The two papers comprising this session argue that current technology yields a performance that is only an order of magnitude in error rate away from human performance and that incremental improvements will bring us to that desired level. I argue that, to the contrary, present performance is far removed from human performance and a revolution in our thinking is required to achieve the goal. It is further asserted that to bring about the revolution more effort should be expended on basic research and less on trying to prematurely commercialize a deficient technology.The title of this paper undoubtedly connotes different things to different people. The intention of the organizing committee of the colloquium on Human-Machine Communication by Voice, however, was quite specific, namely to review the most advanced technology of the day as it is practiced in research laboratories. Thus, this paper fits rather neatly between one given by J. L. Flanagan (1), which discusses the fundamental science on which a speech recognition technology might rest, and those of J. G. Wilpon (2), H. Levitt (3), C. Seelbach (4), C. Weinstein (5), and J. Oberteuffer (6), which are devoted to real applications of speech recognition machines. While it is true that these applications use derivatives of some of the advanced techniques discussed here, they are not as ambitious as the purely experimental systems.In keeping with the theme of advanced technology, J. Makhoul and R. Schwartz report on the "State of the Art in Continuous Speech Recognition" (7). They give a phonetic and phonological description of speech and show how that structure is captured by a mathematical object called a hidden Markov model (HMM). This discussion includes a brief account of the history of the HMM and its application in speech recognition. Also included in the paper are discussions of extracting features from the speech waveform, measuring the performance of the system, and the possibility of using the newer methods based on artificial neural networks.Makhoul and Schwartz (7) conclude that, as a result of the advances made in model accuracy, algorithms, and the power of computers, a "paradigm shift" has occurred in the sense that high-accuracy, real-time, speaker-independent, continuous speech recognition for medium-sized vocabularies can be implemented in software running on commercially available workstations. This assertion provoked an important and lively debate that I shall recount later in this paper. The HMM methodology allows us to cast the speech recognition problem as that of searching for the best path through a weighted, directed graph. The paper by F. Jelinek (8) addresses two central and specific technical issues arising from this representation. First, how does one estimate the parameters of the model (i.e., weights of the graph) from data? This is usually referred to as the training problem. Second, given an optimal model, how does one use it in the recognition ta...

show abstract

Toward the ultimate synthesis/recognition system.

Cited by 5 publications

References 18 publications

What is the Value of Embedding Artificial Emotional Prosody in Human–Computer Interactions? Implications for Theory and Design in Psychological Science

What is the Value of Embedding Artificial Emotional Prosody in Human–Computer Interactions? Implications for Theory and Design in Psychological Science

Speech technology in the year 2001.

Speech recognition technology: a critique.

Contact Info

Product

Resources

About