Semi-Supervised Data Programming with Subset Selection

Maheshwari, Ayush; Chatterjee, Oishik; Killamsetty, KrishnaTeja; Ramakrishnan, Ganesh; Iyer, Rishabh

doi:10.48550/arxiv.2008.09887

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another close body of research taps into weak sources of supervision like regular expressions, keywords, and knowledge base alignment (Mintz et al, 2009;Augenstein et al, 2016;Ratner et al, 2017). Researchers have incorporated these weak supervision signals into self-training procedures like ours (Karamanolakis et al, 2021), as well as constructing procedural generators for boosting weak supervision signals (Zhang et al, 2021a) and interactive pipelines for machine-assisted rule construction (Zhang et al, 2022;Galhotra et al, 2021;Maheshwari et al, 2020). There is also research on automatically generating weak labeling functions (Varma and Ré, 2018;Maheshwari et al, 2021) which shares our bag-of-words featurization and regression scoring mechanism.…”

Section: Related Workmentioning

confidence: 99%

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Pryzant¹,

Yang²,

Yi‐chong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Semi-supervised learning has shown promise in allowing NLP models to generalize from small amounts of labeled data. Meanwhile, pretrained transformer models act as blackbox correlation engines that are difficult to explain and sometimes behave unreliably. In this paper, we propose tackling both of these challenges via Automatic Rule Induction (ARI), a simple and general-purpose framework for the automatic discovery and integration of symbolic rules into pretrained transformer models. First, we extract weak symbolic rules from low-capacity machine learning models trained on small amounts of labeled data. Next, we use an attention mechanism to integrate these rules into high-capacity pretrained transformer models. Last, the rule-augmented system becomes part of a self-training framework to boost supervision signal on unlabeled data. These steps can be layered beneath a variety of existing weak supervision and semisupervised NLP algorithms in order to improve performance and interpretability. Experiments across nine sequence classification and relation extraction tasks suggest that ARI can improve state-of-the-art methods with no manual effort and minimal computational overhead.

show abstract

Section: Related Workmentioning

confidence: 99%

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Pryzant¹,

Yang²,

Yi‐chong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While the classical subset selection problem is NP-hard, we can leverage the diminishing gains property of submodular functions (Fujishige, 2005) and frame subset selection as a submodular maximization problem. Several recent works (Wei et al, 2015;Mirzasoleiman et al, 2020;Kothawade et al, 2021;Karanam et al, 2022;Maheshwari et al, 2020) have formulated the subset selection problem as that of maximizing a submodular objective. However, applying existing subset selection frameworks to PTLMs is nontrivial given the scale of corpora typically used for pre-training (e.g., Wikipedia and Common Crawl consisting of hundreds of millions of sequences and billions of tokens).…”

Section: Introductionmentioning

confidence: 99%

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

Renduchintala,

Killamsetty,

Bhatia

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the stateof-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data.The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to ∼ 99% of the performance of the fully-trained models.

show abstract

Semi-Supervised Data Programming with Subset Selection

Cited by 2 publications

References 0 publications

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

Contact Info

Product

Resources

About