Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
This work focuses on data mining applied to the clinical documentation domain. Diagnostic terms (DTs) are used as keywords to retrieve valuable information from electronic health records. Indeed, they are encoded manually by experts following the International Classification of Diseases (ICD). The goal of this work is to explore the aid of text mining on DT encoding. From the machine learning (ML) perspective, this is a high-dimensional classification task, as it comprises thousands of codes. This work delves into a robust representation of the instances to improve ML results. The proposed system is able to find the right ICD code among more than 1500 possible ICD codes with 92% precision for the main disease (primary class) and 88% for the main disease together with the nonessential modifiers (fully specified class). The methodology employed is simple and portable. According to the experts from public hospitals, the system is very useful in particular for documentation and pharmacosurveillance services. In fact, they reported an accuracy of 91.2% on a small randomly extracted test. Hence, together with this paper, we made the software publicly available in order to help the clinical and research community.
This paper presents experiments performed on lexical knowledge acquisition in the form of verbal argumental information. The system obtains the data from raw corpora after the application of a partial parser and statistical filters. We used two different statistical filters to acquire the argumental information: Mutual Information, and Fisher's Exact test. Due to the characteristics of agglutinative languages like Basque, the usual classification of arguments in terms of their syntactic category (such as NP or PP) is not suitable. For that reason, the arguments will be classified in 48 different kinds of case markers, which makes the system fine grained if compared to equivalent systems that have been developed for other languages. This work addresses the problem of distinguishing arguments from adjuncts, this being one of the most significant sources of noise in subcategorization frame acquisition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.