Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages

Dehouck, Mathieu; Gómez-Rodríguez, Carlos

doi:10.18653/v1/2020.coling-main.339

Cited by 15 publications

(14 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To go beyond the token level and add more diversity to the augmented sentences, data augmentation can also be performed on sentence parts. Operations that (depending on the task) do not change the label include manipulation of parts of the dependency tree (Şahin and Steedman, 2018;Vania et al, 2019;Dehouck and Gómez-Rodríguez, 2020), simplification of sentences by removal of sentence parts (Şahin and Steedman, 2018) and inversion of the subject-object relation (Min et al, 2020). For whole sentences, paraphrasing through backtranslation can be used.…”

Section: Data Augmentationmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

145

View full text Add to dashboard Cite

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

show abstract

Section: Data Augmentationmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

145

View full text Add to dashboard Cite

show abstract

“…Augmented data For the experiment using augmented data we use a subset of the smallest treebanks, namely Kazakh, Kurmanji, and Upper Sorbian. We then generate data using the subtree swapping data augmentation technique of Dehouck and Gómez-Rodríguez (2020). We generate 10, 25, and 50 trees for each and we then split them 80|20.…”

Section: Low Resource Datamentioning

confidence: 99%

“…During this second swap, we do not allow the previously swapped subtree to be altered again so as to avoid redundancy. For a more detailed description of this process see Dehouck and Gómez-Rodríguez (2020). We create all possible trees generated from the three original trees given the constraints described above, repeat this for each triplet of trees, and finally take a sample from this set of augmented data.…”

Section: Subtree Swappingmentioning

confidence: 99%

A Falta de Pan, Buenas Son Tortas: The Efficacy of Predicted UPOS Tags for Low Resource UD Parsing

Anderson¹,

Dehouck²,

Gómez-Rodríguez³

2021

Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing Into Enhanced

Self Cite

View full text Add to dashboard Cite

We evaluate the efficacy of predicted UPOS tags as input features for dependency parsers in lower resource settings to evaluate how treebank size affects the impact tagging accuracy has on parsing performance. We do this for real low resource universal dependency treebanks, artificially low resource data with varying treebank sizes, and for very small treebanks with varying amounts of augmented data. We find that predicted UPOS tags are somewhat helpful for low resource treebanks, especially when fewer fully-annotated trees are available. We also find that this positive impact diminishes as the amount of data increases. * Lacking yeast-proven bread, a flatbread alternative will suffice, i.e. if you can't get more fully-annotated dependency trees, annotating UPOS tags can still be helpful.

show abstract

“…Data augmentation has been found effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al, 2017;Gao et al, 2019;Xia et al, 2019, inter alia), text classification (Wei and Zou, 2019;Quteineh et al, 2020), syntactic and semantic parsing (Jia and Liang, 2016;Shi et al, 2020;Dehouck and Gómez-Rodríguez, 2020), semantic role labeling (Fürstenau and Lapata, 2009), and dialogue understanding (Hou et al, 2018;Niu and Bansal, 2019). Such methods enhance the diversity of the training set by generating examples based on existing ones, and can make the learned models more robust against noise (Xie et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…SUB 2 also falls into this category. The idea of same-label substructure substitution has been used to improve performance on structured prediction tasks such as semantic parsing (Jia and Liang, 2016), constituency parsing (Shi et al, 2020), dependency parsing (Dehouck and Gómez-Rodríguez, 2020), named entity recognition (Dai and Adel, 2020), meaning representation-based text generation (Kedzie and McKeown, 2020), and compositional generalization (Andreas, 2020). To the best of our knowledge, however, SUB 2 has not been systematically studied as a general data augmentation method for NLP tasks.…”

Section: Introductionmentioning

confidence: 99%

Substructure Substitution: Structured Data Augmentation for NLP

Shi¹,

Livescu²,

Gimpel³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We study a family of data augmentation methods, substructure substitution (SUB 2 ), that generalizes prior methods. SUB 2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with others having the same label. This idea can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB 2 based on text spans or parse trees, introducing structureaware data augmentation methods to general NLP tasks. For most cases, training with a dataset augmented by SUB 2 achieves better performance than training with the original training set. Further experiments show that SUB 2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset. 1

show abstract

Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages

Cited by 15 publications

References 9 publications

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

A Falta de Pan, Buenas Son Tortas: The Efficacy of Predicted UPOS Tags for Low Resource UD Parsing

Substructure Substitution: Structured Data Augmentation for NLP

Contact Info

Product

Resources

About