Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
SUMMARYWe propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model's complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement.
We address corpus building situations, where complete annotations to the whole corpus is time consuming and unrealistic. Thus, annotation is done only on crucial part of sentences, or contains unresolved label ambiguities. We propose a parameter estimation method for Conditional Random Fields (CRFs), which enables us to use such incomplete annotations. We show promising results of our method as applied to two types of NLP tasks: a domain adaptation task of a Japanese word segmentation using partial annotations, and a partof-speech tagging task using ambiguous tags in the Penn treebank corpus.
In the process of establish in g the it, form ation theory, C. F,. Shannon prol)ose.d the Markov I)ro(:ess as a good model to characterize ~t natural la.nguage. The core or this ide.a is t;o cah:ula.te the ['re(lU('Iides of strings compose(l of 'n characters ('n-grams), but this statistical analysis of large text. (lata a.,id for a large n lilts llever be(HI carried ()tit })eca./ise of the memory limitation of (:omputer and the shortage of text data. Taking advantage of the recent powerful computers we developed a. new aJgorithm of n-grams of large text data for arbitr~try hu'ge 'n a,nd (:alculated successl'ully, within ,'ela, tiv(.ly short thlle~ n-grams of some Japa,nese text (la, t~t containing between two an(l thirty million chara,(:ters.
Unknown words are inevitable at any step of analysis in natural language processing. Wc propose a method to extract words from a corl)us and estimate the probability that each word belongs to given parts of speech (POSs), using a distributional analysis. Our experiments have shown that this method is etfective for inferring the POS of unknown words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.