Michiya Yamamoto scite author profile

Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber and Myers 1990) were first introduced to compute the frequency and location of a substring (n-gram) in a sequence (corpus) of length N. To compute frequencies over all N(N+1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. In this way, a prohibitive computation over substrings is reduced to a manageable computation over classes. This paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (df) for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun. The second half of the paper uses these frequencies to find “interesting” substrings. Lexicographers have been interested in n-grams with high mutual information (MI) where the joint term frequency is higher than what would be expected by chance, assuming that the parts of the n-gram combine independently. Residual inverse document frequency (RIDF) compares document frequency to another model of chance where terms with a particular term frequency are distributed randomly throughout the collection. MI tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas RIDF tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents). The combination of both MI and RIDF is better than either by itself in a Japanese word extraction task.

show abstract

Paraoxonase gene polymorphism in Japanese subjects with coronary heart disease

Suehiro

Nakauchi

Yamamoto

et al. 1996

International Journal of Cardiology

124

View full text Add to dashboard Cite

Significance of angiotensin I-converting enzyme and angiotensin II type 1 receptor gene polymorphisms as risk factors for coronary heart disease

et al. 1996

View full text Add to dashboard Cite

Japanese Dictation Toolkit. 1997 version.

Kawahara

Lee

Kobayashi

et al. 1999

J. Acoust. Soc. Jpn. (E), J Acoust Soc Jpn E

View full text Add to dashboard Cite

The Japanese Dictation Toolkit has been designed and developed as a baseline platform for Japanese LVCSR (Large Vocabulary Continuous Speech Recognition). The platform consists of a standard recognition engine, Japanese phone models and Japanese statistical language models. We set up a variety of Japanese phone HMMs from a contextindependent monophone to a triphone model of thousands of states. They are trained with ASJ (The Acoustical Society of Japan) databases. A lexicon and word N-gram (2-gram and 3-gram) models are constructed with a corpus of Mainichi newspaper. The recognition engine JULIUS is developed for evaluation of both acoustic and language models. As an integrated system of these modules, we have implemented a baseline 5,000-word dictation system and evaluated various components. The software repository is available to the public. +1

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.