Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.339
|View full text |Cite
|
Sign up to set email alerts
|

Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages

Abstract: The lack of annotated data is a big issue for building reliable NLP systems for most of the world's languages. But this problem can be alleviated by automatic data generation. In this paper, we present a new data augmentation method for artificially creating new dependency-annotated sentences. The main idea is to swap subtrees between annotated sentences while enforcing strong constraints on those trees to ensure maximal grammaticality of the new sentences. We also propose a method to perform low-resource expe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 9 publications
0
14
0
Order By: Relevance
“…To go beyond the token level and add more diversity to the augmented sentences, data augmentation can also be performed on sentence parts. Operations that (depending on the task) do not change the label include manipulation of parts of the dependency tree (Şahin and Steedman, 2018;Vania et al, 2019;Dehouck and Gómez-Rodríguez, 2020), simplification of sentences by removal of sentence parts (Şahin and Steedman, 2018) and inversion of the subject-object relation (Min et al, 2020). For whole sentences, paraphrasing through backtranslation can be used.…”
Section: Data Augmentationmentioning
confidence: 99%
“…To go beyond the token level and add more diversity to the augmented sentences, data augmentation can also be performed on sentence parts. Operations that (depending on the task) do not change the label include manipulation of parts of the dependency tree (Şahin and Steedman, 2018;Vania et al, 2019;Dehouck and Gómez-Rodríguez, 2020), simplification of sentences by removal of sentence parts (Şahin and Steedman, 2018) and inversion of the subject-object relation (Min et al, 2020). For whole sentences, paraphrasing through backtranslation can be used.…”
Section: Data Augmentationmentioning
confidence: 99%
“…Augmented data For the experiment using augmented data we use a subset of the smallest treebanks, namely Kazakh, Kurmanji, and Upper Sorbian. We then generate data using the subtree swapping data augmentation technique of Dehouck and Gómez-Rodríguez (2020). We generate 10, 25, and 50 trees for each and we then split them 80|20.…”
Section: Low Resource Datamentioning
confidence: 99%
“…During this second swap, we do not allow the previously swapped subtree to be altered again so as to avoid redundancy. For a more detailed description of this process see Dehouck and Gómez-Rodríguez (2020). We create all possible trees generated from the three original trees given the constraints described above, repeat this for each triplet of trees, and finally take a sample from this set of augmented data.…”
Section: Subtree Swappingmentioning
confidence: 99%
“…Data augmentation has been found effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al, 2017;Gao et al, 2019;Xia et al, 2019, inter alia), text classification (Wei and Zou, 2019;Quteineh et al, 2020), syntactic and semantic parsing (Jia and Liang, 2016;Shi et al, 2020;Dehouck and Gómez-Rodríguez, 2020), semantic role labeling (Fürstenau and Lapata, 2009), and dialogue understanding (Hou et al, 2018;Niu and Bansal, 2019). Such methods enhance the diversity of the training set by generating examples based on existing ones, and can make the learned models more robust against noise (Xie et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…SUB 2 also falls into this category. The idea of same-label substructure substitution has been used to improve performance on structured prediction tasks such as semantic parsing (Jia and Liang, 2016), constituency parsing (Shi et al, 2020), dependency parsing (Dehouck and Gómez-Rodríguez, 2020), named entity recognition (Dai and Adel, 2020), meaning representation-based text generation (Kedzie and McKeown, 2020), and compositional generalization (Andreas, 2020). To the best of our knowledge, however, SUB 2 has not been systematically studied as a general data augmentation method for NLP tasks.…”
Section: Introductionmentioning
confidence: 99%