2016
DOI: 10.1162/tacl_a_00113
|View full text |Cite
|
Sign up to set email alerts
|

The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

Abstract: We release Galactic Dependencies 1.0-a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar lan-arXiv:1710.03838v1 [cs.CL]

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
65
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 51 publications
(65 citation statements)
references
References 19 publications
0
65
0
Order By: Relevance
“…The world presumably does not offer enough natural languages-particularly with machine-readable corpora-to train a good classifier to detect, say, Object-Verb-Subject (OVS) languages, especially given the class imbalance problem that OVS languages are empirically rare, and the non-IID problem that the available OVS languages may be evolutionarily related. 1 We mitigate this issue by training on the Galactic Dependencies treebanks (Wang and Eisner, 2016), a collection of more than 50,000 human-like synthetic languages. The treebank of each synthetic language is generated by stochastically permuting the subtrees in a given real treebank to match the word order of other real languages.…”
Section: Approachmentioning
confidence: 99%
See 2 more Smart Citations
“…The world presumably does not offer enough natural languages-particularly with machine-readable corpora-to train a good classifier to detect, say, Object-Verb-Subject (OVS) languages, especially given the class imbalance problem that OVS languages are empirically rare, and the non-IID problem that the available OVS languages may be evolutionarily related. 1 We mitigate this issue by training on the Galactic Dependencies treebanks (Wang and Eisner, 2016), a collection of more than 50,000 human-like synthetic languages. The treebank of each synthetic language is generated by stochastically permuting the subtrees in a given real treebank to match the word order of other real languages.…”
Section: Approachmentioning
confidence: 99%
“…GD: Galactic Dependencies version 1.0 (Wang and Eisner, 2016) A collection of projective dependency treebanks for 53,428 synthetic languages, using the same format as UD. The treebank of each synthetic language is generated from the UD treebank of some real language by stochastically permuting the dependents of all nouns and/or verbs to match the dependent orders of other real UD languages.…”
Section: Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Alternatively, the lack of target annotated data can be alleviated by synthesizing new examples, thus boosting the variety and amount of the source data. For instance, the Galactic Dependency Treebanks stem from real trees whose nodes have been permuted probabilistically according to the word order rules for nouns and verbs in other languages (Wang and Eisner 2016). Synthetic trees improve the performance of model transfer for parsing when the source is chosen in a supervised way (performance on target development data) and in an unsupervised way (coverage of target PoS sequences).…”
Section: Data Selection Synthesis and Preprocessingmentioning
confidence: 99%
“…Crosslingual transfer can eliminate the need for the expensive and time-consuming task of treebank annotation for low-resource languages. Approaches include annotation projection using parallel data sets (Hwa et al, 2005;Ganchev et al, 2009), direct model transfer through learning of a delexicalized model from other treebanks (Zeman and Resnik, 2008;Täckström et al, 2013), treebank translation (Tiedemann et al, 2014), using synthetic treebanks (Tiedemann and Agić, 2016;Wang and Eisner, 2016), using cross-lingual word representations (Täckström et al, 2012;Guo et al, 2016;Rasooli and Collins, 2017) and using cross-lingual dictionaries (Durrett et al, 2012).…”
Section: Introductionmentioning
confidence: 99%