Keywords: corpus design, Bulgarian National Corpus, computational linguisticsJournal of Language Modelling Vol 0, No 1 (2012), pp. 65-110 [ 66 ] The Bulgarian National Corpus: Theory and Practice multilingual (and in particular parallel) corpora that cover a wide variety of styles, thematic domains, and genres.This paper contributes to the discussion on the perspectives of corpus development in three ways: (i) by reconsidering several key traditional principles underlying corpus design, (ii) by proposing an approach in corpus design based on the revision of those fundamentals in light of recent advances in NLP technologies, (iii) by illustrating how the proposed model is applied in the Bulgarian National Corpus (BulNC).The study is placed in the context of well-known corpora, both mono-and multilingual (Section 2), with an outline of their general features. The concepts of corpus size, balance, and representativeness are discussed in Section 3. In the same section we present our concept of corpora, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies for web crawling and language processing. Section 4 presents the process of compiling, structuring, documenting, and annotating the BulNC, followed by a brief evaluation of the quality of the corpus and an outline of some current applications. 2 overview of contemporary monolingual and multilingual corpora 1 http://www.natcorp.ox.ac.uk 2 http://www.natcorp.ox.ac.uk/corpus/creating.xml [ 67 ] Svetla Koeva et al.only written, but also spoken language, respectively 90% and 10% of the samples. It is POS-tagged, lemmatised, and supplied with detailed metatextual information. The corpus (text and annotated data) can be searched both online -through various search tools, and offline using XAIRA 3 .2. The Corpus of Contemporary American English 4 (COCA) is a 450+ million-word corpus currently in progress with an increase rate of 20 million words per year. The texts are evenly divided between 5 categories -spoken language, fiction, popular magazines, newspapers, and academic writing (Davies, 2010), each category currently containing 90 to 95 million tokens (as of June 2012). The corpus provides a web search interface (shared with the Google Books corpora) that allows searches for regular expressions and specifications for POS, lemma, collocations, frequency and distribution of synonyms. The queries may be refined in terms of genre or time period.3. The Slovak National Corpus 5 (SNK) contains more than 719 million tokens 6 . The texts are divided into several categories with the following distribution: journalism (73%), literary texts (14%), professional texts (12%), and other (1%). A subcorpus of 1.2 million tokens, manually annotated with morphological tags, has also been compiled. The SNK and its subcorpora can be searched with a CQL (Corpus Query Language) compatible query syntax (Christ and Schulze, 1994) through a web interface or via the Bonito client 7 , cf. the Czech National Corpus.4. The Croatian Nat...
Regarding the description of the resources and their alignment, e.g. the number of synsets with assigned frames, we provide the data reported in the latter (more recent) paper.
Санкт-Петербургский государственный университет а , Институт русского языка им. В.В. Виноградова Российской академии наук б , Институт за български език "Проф. Любомир Андрейчин", Българска академия на науките в ,
The study is focused on the semantic and conceptual description of stative verbs. We analyze stative verbs represented in WordNet and the corresponding frames in FrameNet after the alignment between the two resources. After presenting a classification of stative verbs into thematic classes, we outline the components of the conceptual description based on FrameNet frames, the relations between them and the frame elements that describe the frames. We attempt at building a hierarchical structure of frames for each thematic class and a shallow hierarchy of frame elements with a view to their representation and specialization from a more general (parent) frame to more specific (child) frames related to the general one by means of relations such as inheritance, weak inheritance or perspectivization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.