2020
DOI: 10.48550/arxiv.2008.09887
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Semi-Supervised Data Programming with Subset Selection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…Another close body of research taps into weak sources of supervision like regular expressions, keywords, and knowledge base alignment (Mintz et al, 2009;Augenstein et al, 2016;Ratner et al, 2017). Researchers have incorporated these weak supervision signals into self-training procedures like ours (Karamanolakis et al, 2021), as well as constructing procedural generators for boosting weak supervision signals (Zhang et al, 2021a) and interactive pipelines for machine-assisted rule construction (Zhang et al, 2022;Galhotra et al, 2021;Maheshwari et al, 2020). There is also research on automatically generating weak labeling functions (Varma and Ré, 2018;Maheshwari et al, 2021) which shares our bag-of-words featurization and regression scoring mechanism.…”
Section: Related Workmentioning
confidence: 99%
“…Another close body of research taps into weak sources of supervision like regular expressions, keywords, and knowledge base alignment (Mintz et al, 2009;Augenstein et al, 2016;Ratner et al, 2017). Researchers have incorporated these weak supervision signals into self-training procedures like ours (Karamanolakis et al, 2021), as well as constructing procedural generators for boosting weak supervision signals (Zhang et al, 2021a) and interactive pipelines for machine-assisted rule construction (Zhang et al, 2022;Galhotra et al, 2021;Maheshwari et al, 2020). There is also research on automatically generating weak labeling functions (Varma and Ré, 2018;Maheshwari et al, 2021) which shares our bag-of-words featurization and regression scoring mechanism.…”
Section: Related Workmentioning
confidence: 99%
“…While the classical subset selection problem is NP-hard, we can leverage the diminishing gains property of submodular functions (Fujishige, 2005) and frame subset selection as a submodular maximization problem. Several recent works (Wei et al, 2015;Mirzasoleiman et al, 2020;Kothawade et al, 2021;Karanam et al, 2022;Maheshwari et al, 2020) have formulated the subset selection problem as that of maximizing a submodular objective. However, applying existing subset selection frameworks to PTLMs is nontrivial given the scale of corpora typically used for pre-training (e.g., Wikipedia and Common Crawl consisting of hundreds of millions of sequences and billions of tokens).…”
Section: Introductionmentioning
confidence: 99%