Continuous speech was treated as if produced by a finite-state machine making a transition every centisecond. The observable output from state transitions was considered to be a power spectrum—a probabilistic function of the target state of each transition. Using this model, observed sequences of power spectra from real speech were decoded as sequences of acoustic states by means of the Viterbi trellis algorithm. The finite-state machine used as a representation of the speech source was composed of machines representing words, combined according to a “language model.” When trained to the voice of a particular speaker, the decoder recognized seven-digit telephone numbers correctly 96% of the time, with a better than 99% per-digit accuracy. Results for other tests of the system, including syllable and phoneme recognition, will also be given.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.