Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
We propose the "grammatical profile" as a means of probing the aspectual behavior of verbs. A grammatical profile is the relative frequency distribution of the inflected forms of a word in a corpus. The grammatical profiles of Russian verbs provide data on two crucial issues: a) the overall relationship between perfective and imperfective verbs and b) the identification of verbs that characterize various intersections of aspect, tense and mood (TAM) with lexical classes. There is a long-standing debate over whether Russian aspectual "pairs" are formed only via suffixation (the Isačenko hypothesis) or whether they are formed via both suffixation and prefixation (the traditional view). We test the Isačenko hypothesis using data on the corpus frequency of inflected forms of verbs. We find that the behavior of perfective and imperfective verbs is the same regardless of whether the aspectual relationship is marked by prefixes or suffixes; our finding thus supports the traditional view.Introspective descriptions of Russian aspect have often connected the use of particular inflectional forms with certain uses of aspect; for example, the use of imperative forms with the imperfective aspect to produce expressions that are very polite. Grammatical profiles make it possible to identify verbs that behave as outliers, presenting unusually large proportions of usage in parts of the paradigm. This analysis both gives substance to and extends previous introspective descriptions by identifying the verbs most involved in certain TAM-category interactions. On a methodological level, this study contributes to current discussions on the use of inflected forms vs. lemmas in corpus studies. Newman (2008) finds valuable information at the level of the inflectional form, and Gries (forthcoming) argues that inflectional forms do not necessarily provide a better basis for analysis than lemmas. We agree with them that the appropriate level of granularity is determined by both the language and the linguistic phenomenon under analysis.
A new kind of frequency dictionary is a valuable reference for researchers and students of Russian. It shows the grammatical profiles of nouns, adjectives, and verbs, namely the distribution of grammatical forms in the inflectional paradigm. The dictionary is based on data from the Russian National Corpus (RNC) and covers a core vocabulary (5,000 most frequently used lexemes). Russian is a morphologically rich language: its noun paradigms harbor two dozen case and number forms, while verb paradigms include up to 160 grammatical forms. The dictionary departs from traditional frequency lexicography in several ways: 1) word forms are arranged in paradigms, so their frequencies can be compared and ranked; 2) the dictionary is focused on the grammatical profiles of individual lexemes, rather than on the overall distribution of grammatical features (e.g., the fact that Future forms are used less frequently than Past forms); 3) the grammatical profiles of lexical units can be compared against the mean scores of their lexicosemantic class; 4) in each part of speech or semantic class, lexemes with certain biases in the grammatical profile can be easily detected (e.g. verbs used mostly in the Imperative, Past neutral, or nouns often used in the plural); and, 5) the distribution of homonymous word forms and grammatical variants can be followed over time and within certain genres and registers. The dictionary will be a source for research in the field of Russian grammar, paradigm structure, form acquisition, grammatical semantics, as well as variation of grammatical forms. The main challenge for this initiative is the intra-paradigm and inter-paradigm homonymy of word forms in the corpus data. Manual disambiguation is accurate but covers approximately five million words in the RNC, so the data may be sparse and possibly unreliable. Automatic disambiguation yields slightly worse results. However, a larger corpus shows more reliable data for rare word forms. A user can switch between a ‛basicʼ version, which is based on a smaller collection of manually disambiguated texts, and an ‛expandedʼ version, which is based on the main corpus, a newspaper corpus, a corpus of poetry, and the spoken corpus (320 million words in total). The article addresses some general issues, such as establishing the common basis of comparison, a level of granularity for the grammatical profile, and units of measurement. We suggest certain solutions related to the selection of data, corpus data processing, and maintaining the online version of the frequency dictionary.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.