2019
DOI: 10.1007/s10579-019-09480-6
|View full text |Cite
|
Sign up to set email alerts
|

NorthEuraLex: a wide-coverage lexical database of Northern Eurasia

Abstract: This article describes the first release version of a new lexicostatistical database of Northern Eurasia, which includes Europe as the most well-researched linguistic area. Unlike in other areas of the world, where databases are restricted to covering a small number of concepts as far as possible based on often sparse documentation, good lexical resources providing wide coverage of the lexicon are available even for many smaller languages in our target area. This makes it possible to attain near-completeness f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
39
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 42 publications
(39 citation statements)
references
References 11 publications
0
39
0
Order By: Relevance
“…We obtained word-forms for 1,010 concepts in 41 languages using the NorthEuraLex (NEL) dataset. 53 NEL is compiled from dictionaries and other linguistic resources available for individual languages in Northern Eurasia. Translation pairs can be derived from NEL because it provides word forms for the same set of concepts in multiple languages.…”
mentioning
confidence: 99%
“…We obtained word-forms for 1,010 concepts in 41 languages using the NorthEuraLex (NEL) dataset. 53 NEL is compiled from dictionaries and other linguistic resources available for individual languages in Northern Eurasia. Translation pairs can be derived from NEL because it provides word forms for the same set of concepts in multiple languages.…”
mentioning
confidence: 99%
“…It can further be used to modify the original transcription by replacing tokenized units with new values. 3 How an orthography profile can be applied is illustrated in more detail in Figure 3.…”
Section: Workflow 321 From Raw Data To Tokenized Datamentioning
confidence: 99%
“…2 The permanent link of the Code Ocean Capsule is: https://codeocean.com/capsule/8178287/tree/v2. 3 Orthography profiles proceed in a greedy fashion, converting grapheme sequences in the reverse order of their length, thus starting from the longest grapheme sequence. 4 Linguistic terms which are further explained in our glossary, submitted as part of the supplementary information, are marked in bold font the first time they are introduced.…”
Section: Notesmentioning
confidence: 99%
See 1 more Smart Citation
“…Reproducibility is ensured by fully automatizing the process from the level of word forms to the final combined score. We test our method on the NorthEuraLex database9 (Dellert et al, 2017) which consists of IPA representations of the words for 1016 concepts in 107 languages10 covering 21 language families. The IPA representations are derived in a semi-automated fashion from standard orthographies, and are therefore neither fully phonemic nor fully phonetic representations with some inaccuracies.…”
Section: Introductionmentioning
confidence: 99%