Multiword expressions (MWEs) are known as a "pain in the neck" for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one's heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as "words with spaces". We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-millionword annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.
Keywords: corpus design, Bulgarian National Corpus, computational linguisticsJournal of Language Modelling Vol 0, No 1 (2012), pp. 65-110 [ 66 ] The Bulgarian National Corpus: Theory and Practice multilingual (and in particular parallel) corpora that cover a wide variety of styles, thematic domains, and genres.This paper contributes to the discussion on the perspectives of corpus development in three ways: (i) by reconsidering several key traditional principles underlying corpus design, (ii) by proposing an approach in corpus design based on the revision of those fundamentals in light of recent advances in NLP technologies, (iii) by illustrating how the proposed model is applied in the Bulgarian National Corpus (BulNC).The study is placed in the context of well-known corpora, both mono-and multilingual (Section 2), with an outline of their general features. The concepts of corpus size, balance, and representativeness are discussed in Section 3. In the same section we present our concept of corpora, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies for web crawling and language processing. Section 4 presents the process of compiling, structuring, documenting, and annotating the BulNC, followed by a brief evaluation of the quality of the corpus and an outline of some current applications. 2 overview of contemporary monolingual and multilingual corpora 1 http://www.natcorp.ox.ac.uk 2 http://www.natcorp.ox.ac.uk/corpus/creating.xml [ 67 ] Svetla Koeva et al.only written, but also spoken language, respectively 90% and 10% of the samples. It is POS-tagged, lemmatised, and supplied with detailed metatextual information. The corpus (text and annotated data) can be searched both online -through various search tools, and offline using XAIRA 3 .2. The Corpus of Contemporary American English 4 (COCA) is a 450+ million-word corpus currently in progress with an increase rate of 20 million words per year. The texts are evenly divided between 5 categories -spoken language, fiction, popular magazines, newspapers, and academic writing (Davies, 2010), each category currently containing 90 to 95 million tokens (as of June 2012). The corpus provides a web search interface (shared with the Google Books corpora) that allows searches for regular expressions and specifications for POS, lemma, collocations, frequency and distribution of synonyms. The queries may be refined in terms of genre or time period.3. The Slovak National Corpus 5 (SNK) contains more than 719 million tokens 6 . The texts are divided into several categories with the following distribution: journalism (73%), literary texts (14%), professional texts (12%), and other (1%). A subcorpus of 1.2 million tokens, manually annotated with morphological tags, has also been compiled. The SNK and its subcorpora can be searched with a CQL (Corpus Query Language) compatible query syntax (Christ and Schulze, 1994) through a web interface or via the Bonito client 7 , cf. the Czech National Corpus.4. The Croatian Nat...
Regarding the description of the resources and their alignment, e.g. the number of synsets with assigned frames, we provide the data reported in the latter (more recent) paper.
Our work is focused on the conceptual description of verbs by employing two main resources – the lexical semantic network WordNet and the conceptual frames from FrameNet. We implement a method for inheritance-based mapping between the two resources by transferring the frame assignments from a hypernym to its hyponyms. We discover that the method performs best for directly related pairs of synsets but deteriorates in assignment at two or more steps. The mapping is then used for enhancing each of the resources by expanding it with new entries and by contributing to the resources’ relational structure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.