Development of corpora within the CLaRK system

Simov, Kiril; Simov, Alexander; Kouylekov, Milen; Ivanova, Krasimira; Grigorov, Ilko; Ganev, Hristo

doi:10.3115/1067737.1067795

Cited by 8 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, CLaDA-BG-Dict was designed and implemented to support the verification and extension of BTB-WN. The motivation for this was that the existing version of BTB-WN was initiated in an XML format within the CLaRK System 10 - (Simov et al, 2001). The XML format used during the creation of earlier versions of BTB-WN was not a standard one.…”

Section: System Specifics and Functionalitiesmentioning

confidence: 99%

The CLaDA-BG Dictionary Creation System: Specifics and Perspectives

Angelov¹,

Simov²,

Osenova³

et al. 2023

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

The paper reports on the current status of a system for creating dictionaries within the CLaDA-BG infrastructure. The system is called CLaDA-BG-Dict. At the heart of the system lies the lexical thesaurus BTB-Wordnet around which all other language resources for Bulgarian are organized. These are various types of dictionaries (morphological, explanatory, terminological, etc.), ontologies (such as DBpedia), corpora (in-house and external). The specific features and functionalities of the system are discussed with respect to the language resourse integrity. Also, the rationale behind the construction of such a system are given together with an outline of its utility for a number of NLP tasks and for various types of users.

show abstract

Section: System Specifics and Functionalitiesmentioning

confidence: 99%

The CLaDA-BG Dictionary Creation System: Specifics and Perspectives

Angelov¹,

Simov²,

Osenova³

et al. 2023

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

show abstract

“…Lemmatization and word sense disambiguation are performed by manually crafted rules, while part-of-speech tagging and morphological tagging are performed by tools based on support vector machines (SVMs). Different parts of the pipeline are developed as part of different systems, including the CLaRK system [20], Gtagger [6], and MaltParser [14].…”

Section: Related Workmentioning

confidence: 99%

An improved Bulgarian natural language processing pipeline

Berbatova,

Ivanov

2023

Ann. Sofia Univ. Fac. Math. Informat.

View full text Add to dashboard Cite

In this paper, we present a language pipeline for processing Bulgarian language data. The pipeline consists of the following steps: tokenization, sentence splitting, part-of-speech tagging, dependency parsing, named entity recognition, lemmatization, and word sense disambiguation. The first two components are based on rules and lists of words specific to the Bulgarian language, while the rest of the components use machine learning algorithms trained on universal dependency data and pretrained word vectors. The pipeline is implemented in the Python library spaCy (https://spacy.io/) and achieves significant results on all the included subtasks. The pipeline is open source and is available on Github (https://github.com/melaniab/spacy-pipeline-bg/) for use by researchers and developers for a variety of natural language processing and text analysis tasks.

show abstract

“…The corpus contains not only simple, but also complex sentences [55]. In [56] is described how the main functionalities of the CLaRK system for corpora development are exploited in the BulTreeBank project. The latter is an XML based software for corpora development first introduced a little earlierin 2001 [54].…”

Section: Text Corporamentioning

confidence: 99%

Text Analytics in Bulgarian: An Overview and Future Directions

Hristova

2021

Cybernetics and Information Technologies

View full text Add to dashboard Cite

Text analytics is becoming an integral part of modern business and economic research and analysis. However, the extent to which its application is possible and accessible varies for different languages. The main goal of this paper is to outline fundamental research on text analytics applied on data in Bulgarian. A review of key research articles in two main directions is provided – development of language resources for Bulgarian and experimenting with Bulgarian text data in practical applications. By summarizing the results of a large literature review, we draw conclusions about the degree of development of the field, the availability of language resources for the Bulgarian language and the extent to which text analytics has been applied in practical problems. Future directions for research are outlined. To the best of the author’s knowledge, this is the first study providing a comprehensive overview of progress in the field of text analytics in Bulgarian.

show abstract

Development of corpora within the CLaRK system

Cited by 8 publications

References 0 publications

The CLaDA-BG Dictionary Creation System: Specifics and Perspectives

The CLaDA-BG Dictionary Creation System: Specifics and Perspectives

An improved Bulgarian natural language processing pipeline

Text Analytics in Bulgarian: An Overview and Future Directions

Contact Info

Product

Resources

About