Ekaterina Tarpomanova scite author profile

Ekaterina Tarpomanova

4Publications

9Citation Statements Received

23Citation Statements Given

How they've been cited

How they cite others

Affiliations

Institute for Bulgarian Language, Bulgarian Academy of Sciences

Publications

Order By: Most citations

The Bulgarian National Corpus: Theory and Practice in Corpus Design

Koeva

Stoyanova

Leseva

et al. 2012

JLM

View full text Add to dashboard Cite

Keywords: corpus design, Bulgarian National Corpus, computational linguisticsJournal of Language Modelling Vol 0, No 1 (2012), pp. 65-110 [ 66 ] The Bulgarian National Corpus: Theory and Practice multilingual (and in particular parallel) corpora that cover a wide variety of styles, thematic domains, and genres.This paper contributes to the discussion on the perspectives of corpus development in three ways: (i) by reconsidering several key traditional principles underlying corpus design, (ii) by proposing an approach in corpus design based on the revision of those fundamentals in light of recent advances in NLP technologies, (iii) by illustrating how the proposed model is applied in the Bulgarian National Corpus (BulNC).The study is placed in the context of well-known corpora, both mono-and multilingual (Section 2), with an outline of their general features. The concepts of corpus size, balance, and representativeness are discussed in Section 3. In the same section we present our concept of corpora, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies for web crawling and language processing. Section 4 presents the process of compiling, structuring, documenting, and annotating the BulNC, followed by a brief evaluation of the quality of the corpus and an outline of some current applications. 2 overview of contemporary monolingual and multilingual corpora 1 http://www.natcorp.ox.ac.uk 2 http://www.natcorp.ox.ac.uk/corpus/creating.xml [ 67 ] Svetla Koeva et al.only written, but also spoken language, respectively 90% and 10% of the samples. It is POS-tagged, lemmatised, and supplied with detailed metatextual information. The corpus (text and annotated data) can be searched both online -through various search tools, and offline using XAIRA 3 .2. The Corpus of Contemporary American English 4 (COCA) is a 450+ million-word corpus currently in progress with an increase rate of 20 million words per year. The texts are evenly divided between 5 categories -spoken language, fiction, popular magazines, newspapers, and academic writing (Davies, 2010), each category currently containing 90 to 95 million tokens (as of June 2012). The corpus provides a web search interface (shared with the Google Books corpora) that allows searches for regular expressions and specifications for POS, lemma, collocations, frequency and distribution of synonyms. The queries may be refined in terms of genre or time period.3. The Slovak National Corpus 5 (SNK) contains more than 719 million tokens 6 . The texts are divided into several categories with the following distribution: journalism (73%), literary texts (14%), professional texts (12%), and other (1%). A subcorpus of 1.2 million tokens, manually annotated with morphological tags, has also been compiled. The SNK and its subcorpora can be searched with a CQL (Corpus Query Language) compatible query syntax (Christ and Schulze, 1994) through a web interface or via the Bonito client 7 , cf. the Czech National Corpus.4. The Croatian Nat...

show abstract

Lexical ex-pression of sadness: From the Homeric language to the modern Bulgarian

Mihaylova¹,

Tarpomanova²,

Mircheva³

2019

PIBL

View full text Add to dashboard Cite

Dativus Ethicus in the Balkan Languages

Tarpomanova¹

View full text Add to dashboard Cite

Found in translation: encoding the source of fear by cases and prepositions in the Balkan languages

Tarpomanova¹,

Mihaylova²

2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.