Markus Bayer scite author profile

Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different groupings and give state-of-the-art references expounding which methods are highly promising by relating them to each other. Finally, research perspectives that may constitute a building block for future work are provided.

show abstract

Rapid relevance classification of social media posts in disasters and emergencies: A system and evaluation featuring active, incremental and online learning

Kaufhold

Bayer

Reuter

2020

Information Processing & Management

View full text Add to dashboard Cite

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Bayer

Kaufhold

Buchhold³

et al. 2022

Int. J. Mach. Learn. & Cyber.

View full text Add to dashboard Cite

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

show abstract

CySecAlert: An Alert Generation System for Cyber Security Events Using Open Source Intelligence Data

Riebe

Wirth

Bayer

et al. 2021

View full text Add to dashboard Cite

OVANA: An Approach to Analyze and Improve the Information Quality of Vulnerability Databases

Küehn

Bayer

Wendelborn

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Markus Bayer

A Survey on Data Augmentation for Text Classification

Rapid relevance classification of social media posts in disasters and emergencies: A system and evaluation featuring active, incremental and online learning

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

CySecAlert: An Alert Generation System for Cyber Security Events Using Open Source Intelligence Data

OVANA: An Approach to Analyze and Improve the Information Quality of Vulnerability Databases

Contact Info

Product

Resources

About