Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.377
|View full text |Cite
|
Sign up to set email alerts
|

CaMEL: Case Marker Extraction without Labels

Abstract: We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…The authors focus on an English-Norwegian lemmatized parallel corpus; in contrast, we investigate 1,335 languages, most of which are low-resource and for many of which lemmatization is not available. In addition, this paper is related to recent work that uses PBC to investigate the typology of tense (Asgari and Schütze, 2017), train massive multilingual embeddings (Dufter et al, 2018), extract multilingual named entities (Severini et al, 2022), find case markers in a multilingual setting (Weissweiler et al, 2022) and learn language embeddings containing typological features (Östling and Kurfalı, 2023).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors focus on an English-Norwegian lemmatized parallel corpus; in contrast, we investigate 1,335 languages, most of which are low-resource and for many of which lemmatization is not available. In addition, this paper is related to recent work that uses PBC to investigate the typology of tense (Asgari and Schütze, 2017), train massive multilingual embeddings (Dufter et al, 2018), extract multilingual named entities (Severini et al, 2022), find case markers in a multilingual setting (Weissweiler et al, 2022) and learn language embeddings containing typological features (Östling and Kurfalı, 2023).…”
Section: Related Workmentioning
confidence: 99%
“…We use the New World edition for each language, if available, and the edition with the largest number of verses otherwise. Different from previous work (Asgari and Schütze, 2017;Dufter et al, 2018;Weissweiler et al, 2022) which only used verses that are available in all languages, we use all parallel verses between English and any other target languages. This means the number of parallel verses between English and other languages can be different.…”
Section: A Details Of Data A1 Parallel Bible Corpusmentioning
confidence: 99%