Parsing Romanian Specialized Dictionaries Structured in Nests

Mărănduc, Cătălina; Mititelu, Cătălin; Simionescu, Radu

doi:10.1145/3078081.3078088

Cited by 1 publication

(2 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using the program DEPAR (Dictionary Parser) (Mȃrȃnduc et al, 2017) we extracted a list of 5,000 stable unanalyzable Multi Word Expressions from a dictionary (Mȃrȃnduc, 2010), and also 98,000 lexical or spelling variants extracted from the Thesaurus Dictionary 3 . The variants are generally old or regional, consequently their introduction in the POS-tagger lexicon is useful for the processing both of the Old and of the regional variants of the 3 http://edtlr.info.uaic.ro/ Daco-Romanian dialect (the name used by the dialectologists for the language spoken in Romania), but they are not useful for the South Danube dialects, that have special dictionaries.…”

Section: The Lexicon For the Old Ro Pos-taggermentioning

confidence: 99%

See 1 more Smart Citation

Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language

Bobicev¹,

Mărănduc²,

Perez³

2017

Proceedings of the Workshop on Language Technology for Digital Humanities in Central and (South-)Eastern Europe

Self Cite

View full text Add to dashboard Cite

Contemporary standard language corpora are ideal for NLP. There are few morphologically and syntactically annotated corpora for Romanian, and those existing or in progress only deal with the Contemporary Romanian standard. However, the necessity to study the dynamics of natural languages gave rise to balanced corpora, containing non-standard texts. In this paper, we describe the creation of tools for processing non-standard Romanian to build a big balanced corpus. We want to preserve in annotated form as many early stages of language as possible. We have already built a corpus in Old Romanian. We also intend to include the South-Danube dialects, remote to the standard language, along with regional forms closer to the standard. We try to preserve data about endangered idioms such as Aromanian, Meglenoromanian and Istroromanian dialects, and calculate the distance between different regional variants, including the language spoken in the Republic of Moldova. This distance, as well as the mutual understanding between the speakers, is the correct criterion for the classification of idioms as different languages, or as dialects, or as regional variants close to the standard.

show abstract

Section: The Lexicon For the Old Ro Pos-taggermentioning

confidence: 99%

“…We can extract lemmas from the dictionaries using the program DEPAR (Dictionary Parser) (Mȃrȃnduc et al, 2017), but the inflexion must be manually introduced in the POS-tagger lexicon. For this purpose, we will have to associate specialists in the South-Danube dialects in our project.…”

Section: Building Pos-taggers For Processing the South Danube Dialectsmentioning

confidence: 99%