2019
DOI: 10.1017/s1351324919000354
|View full text |Cite
|
Sign up to set email alerts
|

Finding next of kin: Cross-lingual embedding spaces for related languages

Abstract: Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 24 publications
0
8
0
Order By: Relevance
“…It will also lead to integrating training of the transformer with the semantic classification task on a deeper level, which can be accomplished by customizing its pre-training (weight-initialization) algorithm to include word semantic information available from existing taxonomies, which we are planning to undertake in future, along with experimenting with cross-lingual knowledge transfer (e.g. [22]), when a model uses English data to predict semantic relations in other, less resourced, languages.…”
Section: Discussionmentioning
confidence: 99%
“…It will also lead to integrating training of the transformer with the semantic classification task on a deeper level, which can be accomplished by customizing its pre-training (weight-initialization) algorithm to include word semantic information available from existing taxonomies, which we are planning to undertake in future, along with experimenting with cross-lingual knowledge transfer (e.g. [22]), when a model uses English data to predict semantic relations in other, less resourced, languages.…”
Section: Discussionmentioning
confidence: 99%
“…The pretrained Rusyn embeddings still need to be projected into the Panslav5 embedding space. To this end, we use a simpler procedure than Sharoff (2018) and take the intersection of word forms of the two files as the seed lexicon (i.e., we assume that words spelled the same should occur in the same area of the vector space). Almost half of the forms present in the Rusyn Fasttext embeddings are covered by this lexicon.…”
Section: Pretrained Rusyn Embeddingsmentioning
confidence: 99%
“…Most approaches start with a seed dictionary containing word pairs of both languages, but some variants merely rely on identical tokens such as punctuation signs, numerals, or named entities (Artetxe, Labaka and Agirre 2017). More relevant to our setting is the work by Sharoff (2018) on related languages: he starts by automatically extracting a seed dictionary from Wikipedia page titles and uses these entries to determine edit distance weights for the language pair in question, assuming that most word pairs are cognates. Weighted Levenshtein distance is then included as a factor in the word embedding projection algorithm.…”
Section: Previous Workmentioning
confidence: 99%
See 2 more Smart Citations