Svetlozara Leseva scite author profile

Svetlozara Leseva

4Publications

11Citation Statements Received

38Citation Statements Given

How they've been cited

How they cite others

Affiliations

Institute for Bulgarian Language, Bulgarian Academy of Sciences

Publications

Order By: Most citations

The Bulgarian National Corpus: Theory and Practice in Corpus Design

Koeva

Stoyanova

Leseva

et al. 2012

JLM

View full text Add to dashboard Cite

Keywords: corpus design, Bulgarian National Corpus, computational linguisticsJournal of Language Modelling Vol 0, No 1 (2012), pp. 65-110 [ 66 ] The Bulgarian National Corpus: Theory and Practice multilingual (and in particular parallel) corpora that cover a wide variety of styles, thematic domains, and genres.This paper contributes to the discussion on the perspectives of corpus development in three ways: (i) by reconsidering several key traditional principles underlying corpus design, (ii) by proposing an approach in corpus design based on the revision of those fundamentals in light of recent advances in NLP technologies, (iii) by illustrating how the proposed model is applied in the Bulgarian National Corpus (BulNC).The study is placed in the context of well-known corpora, both mono-and multilingual (Section 2), with an outline of their general features. The concepts of corpus size, balance, and representativeness are discussed in Section 3. In the same section we present our concept of corpora, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies for web crawling and language processing. Section 4 presents the process of compiling, structuring, documenting, and annotating the BulNC, followed by a brief evaluation of the quality of the corpus and an outline of some current applications. 2 overview of contemporary monolingual and multilingual corpora 1 http://www.natcorp.ox.ac.uk 2 http://www.natcorp.ox.ac.uk/corpus/creating.xml [ 67 ] Svetla Koeva et al.only written, but also spoken language, respectively 90% and 10% of the samples. It is POS-tagged, lemmatised, and supplied with detailed metatextual information. The corpus (text and annotated data) can be searched both online -through various search tools, and offline using XAIRA 3 .2. The Corpus of Contemporary American English 4 (COCA) is a 450+ million-word corpus currently in progress with an increase rate of 20 million words per year. The texts are evenly divided between 5 categories -spoken language, fiction, popular magazines, newspapers, and academic writing (Davies, 2010), each category currently containing 90 to 95 million tokens (as of June 2012). The corpus provides a web search interface (shared with the Google Books corpora) that allows searches for regular expressions and specifications for POS, lemma, collocations, frequency and distribution of synonyms. The queries may be refined in terms of genre or time period.3. The Slovak National Corpus 5 (SNK) contains more than 719 million tokens 6 . The texts are divided into several categories with the following distribution: journalism (73%), literary texts (14%), professional texts (12%), and other (1%). A subcorpus of 1.2 million tokens, manually annotated with morphological tags, has also been compiled. The SNK and its subcorpora can be searched with a CQL (Corpus Query Language) compatible query syntax (Christ and Schulze, 1994) through a web interface or via the Bonito client 7 , cf. the Czech National Corpus.4. The Croatian Nat...

show abstract

Towards a Semantic Network Enriched With a Variety of Semantic Relations

Koeva¹,

Leseva²,

Stoyanova³

et al. 2020

View full text Add to dashboard Cite

show abstract

Subject of Opinion and Subject of Evaluation Dative in Colloquial Bulgarian (as compared with Russian)

Ivanova¹,

Kustova²,

Leseva³

2022

View full text Add to dashboard Cite

show abstract

Stative Verbs: Conceptual Structure, Hierarchy, Systemic Relations

Leseva¹,

Stoyanova²

2022

View full text Add to dashboard Cite

The study is focused on the semantic and conceptual description of stative verbs. We analyze stative verbs represented in WordNet and the corresponding frames in FrameNet after the alignment between the two resources. After presenting a classification of stative verbs into thematic classes, we outline the components of the conceptual description based on FrameNet frames, the relations between them and the frame elements that describe the frames. We attempt at building a hierarchical structure of frames for each thematic class and a shallow hierarchy of frame elements with a view to their representation and specialization from a more general (parent) frame to more specific (child) frames related to the general one by means of relations such as inheritance, weak inheritance or perspectivization.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.