This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contextual and lexical probabilities. This model increases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual information than standard statistical systems. New methods of smoothing the estimated probabilities are also introduced to address the sparse data problem.
Proper names have several properties that create problems for speech recognition systems: the number of names is large and ever changing, names can be borrowed directly from other languages and may not conform to usual pronunciation rules, and the variety of pronunciations for names can be high. Because the set of proper names is so dynamic and machines are notoriously poor at phoneme recognition, a promising approach to designing a name recognition system is to incorporate statistical aspects of proper names (e.g., frequency, familiarity). Unfortunately, there exists relatively little data on the distribution of names. Ratings of familiarity and pronounceability were obtained for a randomly chosen sample of 199 surnames (from 80 987 entries in the Purdue phonebook) and 199 nouns (from Kucera–Francis). The ratings for nouns versus names are substantially different: nouns were rated as more familiar and easier to pronounce than surnames. Frequency and familiarity were more closely related in the proper name pool than the word pool, although the correlations were modest. Ratings of familiarity and pronounceability were highly related for both groups. The value of using frequency and the ratings of familiarity and pronounceability for predicting variations in actual pronunciations of words and names will be discussed.
Ratings of familiarity and pronounceability were obtained from a random sample of 199 sumames (selected from over 80,000 entries in the Purdue University phone book) and 199 nouns (from the Kucera-Francis, 1967, word database). The distributions of ratings for nouns versus names are substantially different: Nouns were rated as more familiar and easier to pronounce than sumames. Frequency and familiarity were more closely related in the proper name pool than the word pool, although both correlations were modest. Ratings of familiarity and pronounceability were highly related for both groups, A production experiment showed that rated pronounceability was highly related to the time taken to produce a name. These data confirm the common belief that there are differences in the statistical and distributional properties of words as compared to proper names. The value of using frequency and the ratings of familiarity and pronounceability for predicting variations in actual pronunciations of words and names are discussed.Recently, there has been an explosion of research regarding how people store and access proper names. The interest has been enough to generate a special issue ofthe journal Memory (Cohen & Burke, 1993) and at least one entire book on the subject (Valentine, Brennen, & Bredart, 1996). Much of the research has concentrated on how names and faces are related and on whether names and other words are represented separately in the lexicon. The studies reported here are meant to provide a corpus ofratings offamiliarity and pronounceability of surnames for use in experiments in human lexical access of names and for predicting variability in pronunciations ofnames for eventual use in applications such as computer speech recognizers.
This paper examines the feasibility of using statistical methods to train a part-of-speech predictor for unknown words. By using statistical methods, without incorporating hand-crafted linguistic information, the predictor could be used with any language for which there is a large tagged training corpus. Encouraging results have been obtained by testing the predictor on unknown words from the Brown corpus. The relative value of information sources such as affixes and context is discussed. This part-ofspeech predictor will be used in a part-of-speech tagger to handle out-of-lexicon words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.