SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup

Zhang, Rongzhi; Yu, Yue; Zhang, Chao

doi:10.18653/v1/2020.emnlp-main.691

Cited by 46 publications

(20 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite their success in text classification and sequence-to-sequence tasks, they are seldom used for sequence tagging tasks. Zhang, Yu, and Zhang (2020) use the mixup technique (Zhang et al 2018) in scope of active learning, where they augment the queries at each iteration and later classify whether the augmented query is plausible-since the resulting queries might be noisy-and report improvements for the Named Entity Recognition (NER) and event detection task. However building a robust discriminator for more challenging tasks as dependency parsing (DP) and semantic role labeling (SRL) is a challenge on its own.…”

Section: Related Workmentioning

confidence: 99%

“…On a "phrase-level attack", the authors first choose two subtrees and then maximize the error rate on the target subtree by modifying the tokens in the source subtree. Even though the adversarial example generation techniques (Zheng et al 2020;Han et al 2020) could be used to augment data in theory, the requirements such as a separate seq2seq generator, a BERT based scorer (Zhang et al 2020), reference parsers that are of certain quality, external POS taggers and high quality pretrained BERT (Devlin et al 2019) models, make them challenging to apply on low-resource languages. Besides, most of the aforementioned adversarial attacks are optimized to trigger an undesired change in the output with minimal modifications, while data augmentation is only concerned about increasing the generalization capacity of the model.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore sophisticated techniques that make use of such models are also left out in this paper. (Jindal et al 2020a) Emb/Hidden classification not suitable SPEECHMIX (Jindal et al 2020b) Emb/Hidden Speech/Audio not suitable MIXTEXT Emb/Hidden classification not suitable SWITCHOUT (Wang et al 2018) Input machine translation not suitable SIGNEDGRAPH (Chen, Ji, and Evans 2020) Input paraphrase not suitable DAGA (Ding et al 2020) Input+Label sequence tagging not suitable SEQMIX (Zhang, Yu, and Zhang 2020) Input+Label active sequence labeling not suitable GECA (Andreas 2020) Input Agnostic not suitable…”

Section: Augmentation Techniquesmentioning

confidence: 99%

See 2 more Smart Citations

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Şahin

2022

Computational Linguistics

View full text Add to dashboard Cite

Data-hungry deep neural networks have established themselves as the defacto standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies which perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels.We systematically compare the methods on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families using various models including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and the model type (e.g., token-level augmentation provide significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Augmentation Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Şahin

2022

Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…[27] constructs a new token sequence by randomly selecting a token from two different sequences at each index. [28] suggests applying mixup in feature space, after an intermediary layer of a pretrained LM. New input data is then generated by reversing the synthetic feature to find the most similar token in the vocabulary.…”

Section: Data Interpolation For Regularizationmentioning

confidence: 99%

“…While these two works involve interpolating text inputs, our method differs significantly in that we do not directly generate augmented training samples; instead, we utilize mixup as a regularizing layer during the training process. Our method also does not require reversing word-embeddings or discriminative filtering using GPT-2 introduced in [28].…”

Section: Data Interpolation For Regularizationmentioning

confidence: 99%

Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation

Park,

Shin,

Paik

et al. 2021

Preprint

View full text Add to dashboard Cite

Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by LMs, at times the pretrained knowledge can hinder error detection performance. For instance, presence of speech disfluencies might confuse the post-processing system into tagging disfluent but accurate transcriptions as ASR errors. Such confusion occurs because both error detection and disfluency detection tasks attempt to identify tokens at statistically unlikely positions. This paper proposes a scheme to improve existing LM-based ASR error detection systems, both in terms of detection scores and resilience to such distracting auxiliary tasks. Our approach adopts the popular mixup method in text feature space and can be utilized with any black-box ASR output. To demonstrate the effectiveness of our method, we conduct postprocessing experiments with both traditional and end-to-end ASR systems (both for English and Korean languages) with 5 different speech corpora. We find that our method improves both ASR error detection F1 scores and reduces the number of correctly transcribed disfluencies wrongly detected as ASR errors. Finally, we suggest methods to utilize resulting LMs directly in semi-supervised ASR training.

show abstract

MKGB: A Medical Knowledge Graph Construction Framework Based on Data Lake and Active Learning

Ren

Hou

Sheng

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup

Cited by 46 publications

References 39 publications

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation

MKGB: A Medical Knowledge Graph Construction Framework Based on Data Lake and Active Learning

Contact Info

Product

Resources

About