Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects 2014
DOI: 10.3115/v1/w14-5309
|View full text |Cite
|
Sign up to set email alerts
|

Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

Abstract: In this paper, we present an approach to developing resources for a low-resource language, taking advantage of the fact that it is closely related to languages with more resources. In particular, we test our approach on Macedonian, which lacks tools for natural language processing as well as data in order to build such tools. We improve the Macedonian training set for supervised part-ofspeech tagging by transferring available manual annotations from a number of similar languages. Our approach is based on multi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
10
1

Year Published

2015
2015
2021
2021

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 7 publications
2
10
1
Order By: Relevance
“…/ Suppose that we choose to wear ourselves out faster." confirm the assumption by Aepli et al [36] that the word classes usually match in translation. Additionally, in most of these cases, POS tag can be easily determined by the nearest previous or following word: брзо потреперување\N, брзо помина\V, брзо беше\V, ќе решиме\V побрзо.…”
Section: Ambiguities In the Corpussupporting
confidence: 87%
See 1 more Smart Citation
“…/ Suppose that we choose to wear ourselves out faster." confirm the assumption by Aepli et al [36] that the word classes usually match in translation. Additionally, in most of these cases, POS tag can be easily determined by the nearest previous or following word: брзо потреперување\N, брзо помина\V, брзо беше\V, ќе решиме\V побрзо.…”
Section: Ambiguities In the Corpussupporting
confidence: 87%
“…Recently, another independent attempt to automatically POS tag the same corpus was done [36]. Tagging was based on the POS tagged versions of Bulgarian, Czech, Slovene, English and Serbian language, which were pair-aligned to Macedonian translation.…”
Section: Fig 1 the First Tool For Manual Annotation Of Orwell's Corpusmentioning
confidence: 99%
“…Follow-up work has focused on the inclusion of several source languages (Fossum and Abney, 2005), more accurate projection algorithms (Das and Petrov, 2011;Duong et al, 2013), the integration of external lexicon sources (Li et al, 2012;Täckström et al, 2013), the extension from part-of-speech tagging to full morphological tagging (Buys and Botha, 2016), and the investigation of truly low-resource settings by resorting to Bible translations (Agić et al, 2015). A related approach (Aepli et al, 2014) uses majority voting to disambiguate tags proposed by several source languages. However, these projection approaches are not adapted to our setting as no parallel corpora -not even the Bible 2 -are electronically available for Rusyn.…”
Section: Related Workmentioning
confidence: 99%
“…There are essentially two ways of combining taggers: using the five source language taggers and choosing the majority vote, or using a single tagger trained on merged data from the five source corpora. Aepli et al (2014) develop a tagger for Macedonian by transferring morphosyntactic annotations from multiple source languages by word alignment, choosing one annotation by majority vote, and training a new tagger on the annotated corpus. We follow a similar method.…”
Section: Single-language Taggersmentioning
confidence: 99%
“…The Bulgarian literary language differs from other Slavic languages by the almost complete loss of grammatical case; the creation of definite article of nouns (appearing in the form of a suffix, added to the stem); analytical comparative and superlative (by word-particles); and a complex tense system where the infinitive is completely lost (Aepli, von Waldenfels, & Samardzic, 2014;Kushniarevich et al, 2015;Raykov, 2005).…”
Section: Introductionmentioning
confidence: 99%